You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by ra...@barclays.com on 2013/06/17 23:03:07 UTC

New Lucene User

Hi,

I have a requirement to perform a full-text search in a new application and I came across Lucene and I want to check if it helps our cause.

Requirement:

I have a SQL Server database table with around 70 million records in it. It is not a live table and the data gets appended to it on a daily basis.

The table has about 30 columns. The user will provide one string, and this value has to be searched against 20 columns for each record. All matching records need to be displayed in the UI.

My Analysis

Based on what I have read until now about Lucene, I believe I need to convert my database table data into a flat file, generate indexes and then perform the search.

Questions


-          To begin with, is Lucene a good option for this kind of requirement? Note: Let us ignore daily index generation and UI display for this discussion.

-          Should the entire data of 70 million records exist in one flat file?

-          How do I define what fields (20 columns) should be searched among the complete list (30 columns)?

As I am just starting off, I may not even know about other dependencies. I kindly request you to provide clarifications / reference to an example that would suit my case.

Please let me know if you have any questions.

Thanks,
Raghu


_______________________________________________

This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.

For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.

_______________________________________________

RE: New Lucene User

Posted by ra...@barclays.com.
Ashwin,

Thank you very much for your suggestions. I will take a look at Solr as well.

Regards,
Raghu



-----Original Message-----
From: Ashwin Tandel [mailto:ashwintandel@gmail.com] 
Sent: Tuesday, June 18, 2013 6:29 PM
To: java-user@lucene.apache.org
Subject: Re: New Lucene User

Raghav,


I would like to second Jack, Solr would take care of indexing your document without writing any code and it has scalability features like replication and sharding if required that would handle large volume of data.

http://lucene.apache.org/solr/

Regards,
Ashwin


On Tue, Jun 18, 2013 at 8:38 AM, <ra...@barclays.com> wrote:

> Heikki,
>
> Thank you very much. I tried it out and the initial results look good.
>
> Although I get "java.lang.OutOfMemoryError: Java heap space" when I 
> search for a single TextField over 70 million records. Probably my 
> code needs tuning.
>
> I'll research more to figure it out. But this is a great start, thanks 
> to everyone who provided suggestions.
>
> Regards,
> Raghu
>
>
> -----Original Message-----
> From: heikki [mailto:tropicano@gmail.com]
> Sent: Monday, June 17, 2013 5:35 PM
> To: java-user@lucene.apache.org
> Subject: Re: New Lucene User
>
> hi,
>
> I think Lucene is an excellent option for you.
>
> You don't need to export the data to a flat file first. You can just 
> access your database (in whatever way you normally like, e.g. using 
> JDBC or Hibernate). You can do this for example once a day, retrieving 
> only modified records. For each record you retrieve, you create a 
> so-called Lucene Document. You add fields to these documents as you 
> see fit -- for example, you want to search in 20 of your 30 columns, 
> so you could add fields containing the values from those 20 columns to the Lucene Document.
> You give each Document to an IndexWriter, which will add it to the 
> Lucene index. When you search, you retrieve such documents, which you 
> can use then to create a UI display for search results.
>
> Of course there's a lot more to say about this and I'd recommend you 
> check online tutorials or one of the Lucene books like *Lucene In 
> Action* to learn more about how to use Lucene in detail.
>
> Kind regards
> Heikki Doeleman
>
>
> On Mon, Jun 17, 2013 at 11:03 PM, <ra...@barclays.com> wrote:
>
> > Hi,
> >
> > I have a requirement to perform a full-text search in a new 
> > application and I came across Lucene and I want to check if it helps 
> > our
> cause.
> >
> > Requirement:
> >
> > I have a SQL Server database table with around 70 million records in it.
> > It is not a live table and the data gets appended to it on a daily basis.
> >
> > The table has about 30 columns. The user will provide one string, 
> > and this value has to be searched against 20 columns for each 
> > record. All matching records need to be displayed in the UI.
> >
> > My Analysis
> >
> > Based on what I have read until now about Lucene, I believe I need 
> > to convert my database table data into a flat file, generate indexes 
> > and then perform the search.
> >
> > Questions
> >
> >
> > -          To begin with, is Lucene a good option for this kind of
> > requirement? Note: Let us ignore daily index generation and UI 
> > display for this discussion.
> >
> > -          Should the entire data of 70 million records exist in one flat
> > file?
> >
> > -          How do I define what fields (20 columns) should be searched
> > among the complete list (30 columns)?
> >
> > As I am just starting off, I may not even know about other 
> > dependencies. I kindly request you to provide clarifications / 
> > reference to an example that would suit my case.
> >
> > Please let me know if you have any questions.
> >
> > Thanks,
> > Raghu
> >
> >
> > _______________________________________________
> >
> > This message is for information purposes only, it is not a 
> > recommendation, advice, offer or solicitation to buy or sell a 
> > product or service nor an official confirmation of any transaction. 
> > It is directed at persons who are professionals and is not intended 
> > for retail customer use. Intended for recipient only. This message 
> > is
> subject to the terms at:
> > www.barclays.com/emaildisclaimer.
> >
> > For important disclosures, please see:
> > www.barclays.com/salesandtradingdisclaimer regarding market 
> > commentary from Barclays Sales and/or Trading, who are active market 
> > participants; and in respect of Barclays Research, including 
> > disclosures relating to specific issuers, please see
> http://publicresearch.barclays.com.
> >
> > _______________________________________________
> >
>
> _______________________________________________
>
> This message is for information purposes only, it is not a 
> recommendation, advice, offer or solicitation to buy or sell a product 
> or service nor an official confirmation of any transaction. It is 
> directed at persons who are professionals and is not intended for 
> retail customer use. Intended for recipient only. This message is subject to the terms at:
> www.barclays.com/emaildisclaimer.
>
> For important disclosures, please see:
> www.barclays.com/salesandtradingdisclaimer regarding market commentary 
> from Barclays Sales and/or Trading, who are active market 
> participants; and in respect of Barclays Research, including 
> disclosures relating to specific issuers, please see http://publicresearch.barclays.com.
>
> _______________________________________________
>
_______________________________________________

This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.

For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.

_______________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: New Lucene User

Posted by Ashwin Tandel <as...@gmail.com>.
Raghav,


I would like to second Jack, Solr would take care of indexing your document
without writing any code and it has scalability features like replication
and sharding if required that would handle large volume of data.

http://lucene.apache.org/solr/

Regards,
Ashwin


On Tue, Jun 18, 2013 at 8:38 AM, <ra...@barclays.com> wrote:

> Heikki,
>
> Thank you very much. I tried it out and the initial results look good.
>
> Although I get "java.lang.OutOfMemoryError: Java heap space" when I search
> for a single TextField over 70 million records. Probably my code needs
> tuning.
>
> I'll research more to figure it out. But this is a great start, thanks to
> everyone who provided suggestions.
>
> Regards,
> Raghu
>
>
> -----Original Message-----
> From: heikki [mailto:tropicano@gmail.com]
> Sent: Monday, June 17, 2013 5:35 PM
> To: java-user@lucene.apache.org
> Subject: Re: New Lucene User
>
> hi,
>
> I think Lucene is an excellent option for you.
>
> You don't need to export the data to a flat file first. You can just
> access your database (in whatever way you normally like, e.g. using JDBC or
> Hibernate). You can do this for example once a day, retrieving only
> modified records. For each record you retrieve, you create a so-called
> Lucene Document. You add fields to these documents as you see fit -- for
> example, you want to search in 20 of your 30 columns, so you could add
> fields containing the values from those 20 columns to the Lucene Document.
> You give each Document to an IndexWriter, which will add it to the Lucene
> index. When you search, you retrieve such documents, which you can use then
> to create a UI display for search results.
>
> Of course there's a lot more to say about this and I'd recommend you check
> online tutorials or one of the Lucene books like *Lucene In Action* to
> learn more about how to use Lucene in detail.
>
> Kind regards
> Heikki Doeleman
>
>
> On Mon, Jun 17, 2013 at 11:03 PM, <ra...@barclays.com> wrote:
>
> > Hi,
> >
> > I have a requirement to perform a full-text search in a new
> > application and I came across Lucene and I want to check if it helps our
> cause.
> >
> > Requirement:
> >
> > I have a SQL Server database table with around 70 million records in it.
> > It is not a live table and the data gets appended to it on a daily basis.
> >
> > The table has about 30 columns. The user will provide one string, and
> > this value has to be searched against 20 columns for each record. All
> > matching records need to be displayed in the UI.
> >
> > My Analysis
> >
> > Based on what I have read until now about Lucene, I believe I need to
> > convert my database table data into a flat file, generate indexes and
> > then perform the search.
> >
> > Questions
> >
> >
> > -          To begin with, is Lucene a good option for this kind of
> > requirement? Note: Let us ignore daily index generation and UI display
> > for this discussion.
> >
> > -          Should the entire data of 70 million records exist in one flat
> > file?
> >
> > -          How do I define what fields (20 columns) should be searched
> > among the complete list (30 columns)?
> >
> > As I am just starting off, I may not even know about other
> > dependencies. I kindly request you to provide clarifications /
> > reference to an example that would suit my case.
> >
> > Please let me know if you have any questions.
> >
> > Thanks,
> > Raghu
> >
> >
> > _______________________________________________
> >
> > This message is for information purposes only, it is not a
> > recommendation, advice, offer or solicitation to buy or sell a product
> > or service nor an official confirmation of any transaction. It is
> > directed at persons who are professionals and is not intended for
> > retail customer use. Intended for recipient only. This message is
> subject to the terms at:
> > www.barclays.com/emaildisclaimer.
> >
> > For important disclosures, please see:
> > www.barclays.com/salesandtradingdisclaimer regarding market commentary
> > from Barclays Sales and/or Trading, who are active market
> > participants; and in respect of Barclays Research, including
> > disclosures relating to specific issuers, please see
> http://publicresearch.barclays.com.
> >
> > _______________________________________________
> >
>
> _______________________________________________
>
> This message is for information purposes only, it is not a recommendation,
> advice, offer or solicitation to buy or sell a product or service nor an
> official confirmation of any transaction. It is directed at persons who are
> professionals and is not intended for retail customer use. Intended for
> recipient only. This message is subject to the terms at:
> www.barclays.com/emaildisclaimer.
>
> For important disclosures, please see:
> www.barclays.com/salesandtradingdisclaimer regarding market commentary
> from Barclays Sales and/or Trading, who are active market participants; and
> in respect of Barclays Research, including disclosures relating to specific
> issuers, please see http://publicresearch.barclays.com.
>
> _______________________________________________
>

RE: New Lucene User

Posted by ra...@barclays.com.
Heikki,

Thank you very much. I tried it out and the initial results look good.

Although I get "java.lang.OutOfMemoryError: Java heap space" when I search for a single TextField over 70 million records. Probably my code needs tuning.

I'll research more to figure it out. But this is a great start, thanks to everyone who provided suggestions.

Regards,
Raghu


-----Original Message-----
From: heikki [mailto:tropicano@gmail.com] 
Sent: Monday, June 17, 2013 5:35 PM
To: java-user@lucene.apache.org
Subject: Re: New Lucene User

hi,

I think Lucene is an excellent option for you.

You don't need to export the data to a flat file first. You can just access your database (in whatever way you normally like, e.g. using JDBC or Hibernate). You can do this for example once a day, retrieving only modified records. For each record you retrieve, you create a so-called Lucene Document. You add fields to these documents as you see fit -- for example, you want to search in 20 of your 30 columns, so you could add fields containing the values from those 20 columns to the Lucene Document.
You give each Document to an IndexWriter, which will add it to the Lucene index. When you search, you retrieve such documents, which you can use then to create a UI display for search results.

Of course there's a lot more to say about this and I'd recommend you check online tutorials or one of the Lucene books like *Lucene In Action* to learn more about how to use Lucene in detail.

Kind regards
Heikki Doeleman


On Mon, Jun 17, 2013 at 11:03 PM, <ra...@barclays.com> wrote:

> Hi,
>
> I have a requirement to perform a full-text search in a new 
> application and I came across Lucene and I want to check if it helps our cause.
>
> Requirement:
>
> I have a SQL Server database table with around 70 million records in it.
> It is not a live table and the data gets appended to it on a daily basis.
>
> The table has about 30 columns. The user will provide one string, and 
> this value has to be searched against 20 columns for each record. All 
> matching records need to be displayed in the UI.
>
> My Analysis
>
> Based on what I have read until now about Lucene, I believe I need to 
> convert my database table data into a flat file, generate indexes and 
> then perform the search.
>
> Questions
>
>
> -          To begin with, is Lucene a good option for this kind of
> requirement? Note: Let us ignore daily index generation and UI display 
> for this discussion.
>
> -          Should the entire data of 70 million records exist in one flat
> file?
>
> -          How do I define what fields (20 columns) should be searched
> among the complete list (30 columns)?
>
> As I am just starting off, I may not even know about other 
> dependencies. I kindly request you to provide clarifications / 
> reference to an example that would suit my case.
>
> Please let me know if you have any questions.
>
> Thanks,
> Raghu
>
>
> _______________________________________________
>
> This message is for information purposes only, it is not a 
> recommendation, advice, offer or solicitation to buy or sell a product 
> or service nor an official confirmation of any transaction. It is 
> directed at persons who are professionals and is not intended for 
> retail customer use. Intended for recipient only. This message is subject to the terms at:
> www.barclays.com/emaildisclaimer.
>
> For important disclosures, please see:
> www.barclays.com/salesandtradingdisclaimer regarding market commentary 
> from Barclays Sales and/or Trading, who are active market 
> participants; and in respect of Barclays Research, including 
> disclosures relating to specific issuers, please see http://publicresearch.barclays.com.
>
> _______________________________________________
>

_______________________________________________

This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.

For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.

_______________________________________________

Re: New Lucene User

Posted by heikki <tr...@gmail.com>.
hi,

I think Lucene is an excellent option for you.

You don't need to export the data to a flat file first. You can just access
your database (in whatever way you normally like, e.g. using JDBC or
Hibernate). You can do this for example once a day, retrieving only
modified records. For each record you retrieve, you create a so-called
Lucene Document. You add fields to these documents as you see fit -- for
example, you want to search in 20 of your 30 columns, so you could add
fields containing the values from those 20 columns to the Lucene Document.
You give each Document to an IndexWriter, which will add it to the Lucene
index. When you search, you retrieve such documents, which you can use then
to create a UI display for search results.

Of course there's a lot more to say about this and I'd recommend you check
online tutorials or one of the Lucene books like *Lucene In Action* to
learn more about how to use Lucene in detail.

Kind regards
Heikki Doeleman


On Mon, Jun 17, 2013 at 11:03 PM, <ra...@barclays.com> wrote:

> Hi,
>
> I have a requirement to perform a full-text search in a new application
> and I came across Lucene and I want to check if it helps our cause.
>
> Requirement:
>
> I have a SQL Server database table with around 70 million records in it.
> It is not a live table and the data gets appended to it on a daily basis.
>
> The table has about 30 columns. The user will provide one string, and this
> value has to be searched against 20 columns for each record. All matching
> records need to be displayed in the UI.
>
> My Analysis
>
> Based on what I have read until now about Lucene, I believe I need to
> convert my database table data into a flat file, generate indexes and then
> perform the search.
>
> Questions
>
>
> -          To begin with, is Lucene a good option for this kind of
> requirement? Note: Let us ignore daily index generation and UI display for
> this discussion.
>
> -          Should the entire data of 70 million records exist in one flat
> file?
>
> -          How do I define what fields (20 columns) should be searched
> among the complete list (30 columns)?
>
> As I am just starting off, I may not even know about other dependencies. I
> kindly request you to provide clarifications / reference to an example that
> would suit my case.
>
> Please let me know if you have any questions.
>
> Thanks,
> Raghu
>
>
> _______________________________________________
>
> This message is for information purposes only, it is not a recommendation,
> advice, offer or solicitation to buy or sell a product or service nor an
> official confirmation of any transaction. It is directed at persons who are
> professionals and is not intended for retail customer use. Intended for
> recipient only. This message is subject to the terms at:
> www.barclays.com/emaildisclaimer.
>
> For important disclosures, please see:
> www.barclays.com/salesandtradingdisclaimer regarding market commentary
> from Barclays Sales and/or Trading, who are active market participants; and
> in respect of Barclays Research, including disclosures relating to specific
> issuers, please see http://publicresearch.barclays.com.
>
> _______________________________________________
>

Re: New Lucene User

Posted by Jack Krupansky <ja...@basetechnology.com>.
Try starting with Solr. You can have your search server up and running 
without writing any code. And Solr's Data Import Handler can load data 
direct from the database.

-- Jack Krupansky

-----Original Message----- 
From: raghavendra.k.rao@barclays.com
Sent: Monday, June 17, 2013 5:03 PM
To: java-user@lucene.apache.org
Subject: New Lucene User

Hi,

I have a requirement to perform a full-text search in a new application and 
I came across Lucene and I want to check if it helps our cause.

Requirement:

I have a SQL Server database table with around 70 million records in it. It 
is not a live table and the data gets appended to it on a daily basis.

The table has about 30 columns. The user will provide one string, and this 
value has to be searched against 20 columns for each record. All matching 
records need to be displayed in the UI.

My Analysis

Based on what I have read until now about Lucene, I believe I need to 
convert my database table data into a flat file, generate indexes and then 
perform the search.

Questions


-          To begin with, is Lucene a good option for this kind of 
requirement? Note: Let us ignore daily index generation and UI display for 
this discussion.

-          Should the entire data of 70 million records exist in one flat 
file?

-          How do I define what fields (20 columns) should be searched among 
the complete list (30 columns)?

As I am just starting off, I may not even know about other dependencies. I 
kindly request you to provide clarifications / reference to an example that 
would suit my case.

Please let me know if you have any questions.

Thanks,
Raghu


_______________________________________________

This message is for information purposes only, it is not a recommendation, 
advice, offer or solicitation to buy or sell a product or service nor an 
official confirmation of any transaction. It is directed at persons who are 
professionals and is not intended for retail customer use. Intended for 
recipient only. This message is subject to the terms at: 
www.barclays.com/emaildisclaimer.

For important disclosures, please see: 
www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
Barclays Sales and/or Trading, who are active market participants; and in 
respect of Barclays Research, including disclosures relating to specific 
issuers, please see http://publicresearch.barclays.com.

_______________________________________________ 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org