You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by Eric Advincula <Er...@co.mohave.az.us> on 2009/10/30 21:17:26 UTC

Best way to store book information

I have countless articles in html pages and i'm importing them and parsing out the text only for my searching.  My question is what is the best way to store the "Content"?
 
                                doc = new Document();
                                doc.Add(new Field("Title", title, Field.Store.YES, Field.Index.UN_TOKENIZED));

                                doc.Add(new Field("File", page, Field.Store.YES, Field.Index.UN_TOKENIZED));
                                content = ParseHTML(file);
 

                                doc.Add(new Field("Content", content.Trim(), Field.Store.YES, Field.Index.TOKENIZED));
                                writer.AddDocument(doc);
 
I'm only searching the "Content" portion not the other two.  So my questions are:

1.  Should I add Vectors when i save it?  If so which one
     Yes, With_Positions, With_Offsets, With_Position_Offsets
2.  Should I add boosting to this Field?
3.  What is the best way to search the content?  Something like when you type in google?  
 
Thanks

RE: Best way to store book information

Posted by Digy <di...@gmail.com>.

>How does that work though if i dont store the "Content" to yes?  If I dont
store it then i cant search from it can I?.  What I do is search the
"Content" and use the "Title" and "File" to retrieve the actual html page
which is in a directory path.  Can I still search in the "Content" if i dont
store it?

YES. 

> Should I use Vectors also when storing?  If so which one?
NO NEED. 

>Will TermEnum work for searching like "SQL Server database tuning" as a
search?
>Do you happen to have an example on doing a search using TermEnum?

I have no idea about " SQL Server database tuning ". But TermEnum can be
used to show the alternatives while the user is typing a word(if it is what
you are asking). See the discusssion "Alternative to looping through Hits".

DIGY




-----Original Message-----
From: Eric Advincula [mailto:Eric.Advincula@co.mohave.az.us] 
Sent: Friday, October 30, 2009 10:54 PM
To: lucene-net-user@incubator.apache.org
Subject: RE: Best way to store book information

How does that work though if i dont store the "Content" to yes?  If I dont
store it then i cant search from it can I?.  What I do is search the
"Content" and use the "Title" and "File" to retrieve the actual html page
which is in a directory path.  Can I still search in the "Content" if i dont
store it?
 
Should I use Vectors also when storing?  If so which one?
 
Will TermEnum work for searching like "SQL Server database tuning" as a
search?
Do you happen to have an example on doing a search using TermEnum?


>>> 

From: "Digy" <di...@gmail.com>
To:<lu...@incubator.apache.org>
Date: 10/30/2009 1:41 PM
Subject: RE: Best way to store book information
1. If you want to return the field's content to the user then use
"Store.YES", otherwise no need to store it. 
In your case, "Content" can be as "Store.NO" since whole html doc is rarely
returned to the user.
2. if you want to give some "priority" to a specific field/term then use
boosting. For ex, some html pages thought to be important can be boosted.
3. Use TermEnum

DIGY


-----Original Message-----
From: Eric Advincula [mailto:Eric.Advincula@co.mohave.az.us] 
Sent: Friday, October 30, 2009 10:17 PM
To: lucene-net-user@incubator.apache.org 
Subject: Best way to store book information

I have countless articles in html pages and i'm importing them and parsing
out the text only for my searching.  My question is what is the best way to
store the "Content"?

                                doc = new Document();
                                doc.Add(new Field("Title", title,
Field.Store.YES, Field.Index.UN_TOKENIZED));

                                doc.Add(new Field("File", page,
Field.Store.YES, Field.Index.UN_TOKENIZED));
                                content = ParseHTML(file);


                                doc.Add(new Field("Content", content.Trim(),
Field.Store.YES, Field.Index.TOKENIZED));
                                writer.AddDocument(doc);

I'm only searching the "Content" portion not the other two.  So my questions
are:

1.  Should I add Vectors when i save it?  If so which one
     Yes, With_Positions, With_Offsets, With_Position_Offsets
2.  Should I add boosting to this Field?
3.  What is the best way to search the content?  Something like when you
type in google?  

Thanks


!DSPAM:4aeb4e1d494461881617585!

RE: Best way to store book information

Posted by Eric Advincula <Er...@co.mohave.az.us>.

How does that work though if i dont store the "Content" to yes?  If I dont store it then i cant search from it can I?.  What I do is search the "Content" and use the "Title" and "File" to retrieve the actual html page which is in a directory path.  Can I still search in the "Content" if i dont store it?
 
Should I use Vectors also when storing?  If so which one?
 
Will TermEnum work for searching like "SQL Server database tuning" as a search?
Do you happen to have an example on doing a search using TermEnum?


>>> 

From: "Digy" <di...@gmail.com>
To:<lu...@incubator.apache.org>
Date: 10/30/2009 1:41 PM
Subject: RE: Best way to store book information
1. If you want to return the field's content to the user then use
"Store.YES", otherwise no need to store it. 
In your case, "Content" can be as "Store.NO" since whole html doc is rarely
returned to the user.
2. if you want to give some "priority" to a specific field/term then use
boosting. For ex, some html pages thought to be important can be boosted.
3. Use TermEnum

DIGY


-----Original Message-----
From: Eric Advincula [mailto:Eric.Advincula@co.mohave.az.us] 
Sent: Friday, October 30, 2009 10:17 PM
To: lucene-net-user@incubator.apache.org 
Subject: Best way to store book information

I have countless articles in html pages and i'm importing them and parsing
out the text only for my searching.  My question is what is the best way to
store the "Content"?

                                doc = new Document();
                                doc.Add(new Field("Title", title,
Field.Store.YES, Field.Index.UN_TOKENIZED));

                                doc.Add(new Field("File", page,
Field.Store.YES, Field.Index.UN_TOKENIZED));
                                content = ParseHTML(file);


                                doc.Add(new Field("Content", content.Trim(),
Field.Store.YES, Field.Index.TOKENIZED));
                                writer.AddDocument(doc);

I'm only searching the "Content" portion not the other two.  So my questions
are:

1.  Should I add Vectors when i save it?  If so which one
     Yes, With_Positions, With_Offsets, With_Position_Offsets
2.  Should I add boosting to this Field?
3.  What is the best way to search the content?  Something like when you
type in google?  

Thanks


!DSPAM:4aeb4e1d494461881617585!

RE: Excessive IOExceptions in IndexSearcher/QueryParser/FastCharStream?

Posted by Digy <di...@gmail.com>.

It is an expected behaviour inhereted from Lucene.Java and I haven't seen a
(remarkable) performance degrade because of this.

DIGY.

-----Original Message-----
From: Ron Grabowski [mailto:rongrabowski@yahoo.com] 
Sent: Saturday, October 31, 2009 12:52 AM
To: lucene-net-user@incubator.apache.org
Subject: Re: Excessive IOExceptions in
IndexSearcher/QueryParser/FastCharStream?

I'm using
https://svn.apache.org/repos/asf/incubator/lucene.net/tags/Lucene.Net_2_4_0/
src/Lucene.Net.

----- Original Message ----
From: Ron Grabowski <ro...@yahoo.com>
To: lucene-net-user@incubator.apache.org
Sent: Fri, October 30, 2009 6:44:34 PM
Subject: Excessive IOExceptions in IndexSearcher/QueryParser/FastCharStream?

I was profiling my search code and saw an awful lot of Exceptions being
throw for simple usages of QueryParser:

http://www.ronosaurus.com/lucene/indexsearcher_queryparser_ioexception.png

For example this code produces 2 IOExceptions in FastCharStream (line 25):

QueryParser parser = new QueryParser("name", new StandardAnalyzer());
parser.Parse("produce");

Is that normal? In my screenshot there's close to 100 Exceptions within 15
seconds of running some threaded searches.  Would the FastCharStream be
faster if it didn't throw so many Exceptions? I tried hacking CanRead() into
CharStream but didn't get very far.

Re: Excessive IOExceptions in IndexSearcher/QueryParser/FastCharStream?

Posted by Ron Grabowski <ro...@yahoo.com>.

I'm using https://svn.apache.org/repos/asf/incubator/lucene.net/tags/Lucene.Net_2_4_0/src/Lucene.Net.

----- Original Message ----
From: Ron Grabowski <ro...@yahoo.com>
To: lucene-net-user@incubator.apache.org
Sent: Fri, October 30, 2009 6:44:34 PM
Subject: Excessive IOExceptions in IndexSearcher/QueryParser/FastCharStream?

I was profiling my search code and saw an awful lot of Exceptions being throw for simple usages of QueryParser:

http://www.ronosaurus.com/lucene/indexsearcher_queryparser_ioexception.png

For example this code produces 2 IOExceptions in FastCharStream (line 25):

QueryParser parser = new QueryParser("name", new StandardAnalyzer());
parser.Parse("produce");

Is that normal? In my screenshot there's close to 100 Exceptions within 15 seconds of running some threaded searches.  Would the FastCharStream be faster if it didn't throw so many Exceptions? I tried hacking CanRead() into CharStream but didn't get very far.

Excessive IOExceptions in IndexSearcher/QueryParser/FastCharStream?

Posted by Ron Grabowski <ro...@yahoo.com>.

I was profiling my search code and saw an awful lot of Exceptions being throw for simple usages of QueryParser:

 http://www.ronosaurus.com/lucene/indexsearcher_queryparser_ioexception.png

For example this code produces 2 IOExceptions in FastCharStream (line 25):

 QueryParser parser = new QueryParser("name", new StandardAnalyzer());
 parser.Parse("produce");

Is that normal? In my screenshot there's close to 100 Exceptions within 15 seconds of running some threaded searches.  Would the FastCharStream be faster if it didn't throw so many Exceptions? I tried hacking CanRead() into CharStream but didn't get very far.

RE:

Posted by Digy <di...@gmail.com>.

No. Use classical IndexSearcher's search function with a query something
like "search that phrase". (use quotation marks).

DIGY

-----Original Message-----
From: Eric Advincula [mailto:Eric.Advincula@co.mohave.az.us] 
Sent: Friday, October 30, 2009 11:11 PM
To: lucene-net-user@incubator.apache.org
Subject: Re:

Thanks,  What I mean about " SQL Server database tuning " was if i type that
as a phrase I want to search for.  Or any kind of phrase that I would like
to search on not just one word searches but entire phrases.  Would TermEnums
still work?

>>> 

From: "Digy" <di...@gmail.com>
To:<lu...@incubator.apache.org>
Date: 10/30/2009 2:07 PM
mohave.az.us>
In-Reply-To: <4A...@co.mohave.az.us>
Subject: RE: Best way to store book information
Date: Fri, 30 Oct 2009 23:05:18 +0200
Message-ID: <00...@com>
MIME-Version: 1.0
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: Microsoft Office Outlook 12.0
Thread-Index: AcpZo0XaTMlfFFugRWe+iJEwSQL44gAADsWQ
Content-Language: tr
X-Virus-Checked: Checked by ClamAV on apache.org
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Fri Oct 30 14:01:32 2009
X-DSPAM-Confidence: 0.9899
X-DSPAM-Probability: 0.0000
X-DSPAM-Signature: 4aeb542c501105209328925
X-DSPAM-Factors: 27,
List-Post*net, 0.01000,
content+to, 0.01000,
Content-Type*charset="us, 0.01000,
and+i'm, 0.01000,
X-Spam-Status*8.0, 0.01000,
List-Id*net+user.incubator.apache.org>, 0.01000,
Received*(hermes.apache.org+[140.211.11.3]), 0.01000,
Received*co.mohave.az.us>, 0.01000,
Subject*RE, 0.01000,
Received-SPF*(nike.apache.org, 0.01000,
Delivered-To*lucene, 0.01000,
have+no, 0.01000,
an, 0.01000,
an, 0.01000,
importing, 0.01000,
Received*(Postfix+from, 0.01000,
10, 0.01000,
10, 0.01000,
Index, 0.01000,
Index, 0.01000,
Subject*information, 0.01000,
doing+a, 0.01000,
doing+a, 0.01000,
org, 0.01000,
org, 0.01000,
In-Reply-To*co.mohave.az.us>, 0.01000,
What+is, 0.01000

>How does that work though if i dont store the "Content" to yes?  If I dont
store it then i cant search from it can I?.  What I do is search the
"Content" and use the "Title" and "File" to retrieve the actual html page
which is in a directory path.  Can I still search in the "Content" if i dont
store it?

YES. 

> Should I use Vectors also when storing?  If so which one?
NO NEED. 

>Will TermEnum work for searching like "SQL Server database tuning" as a
search?
>Do you happen to have an example on doing a search using TermEnum?

I have no idea about " SQL Server database tuning ". But TermEnum can be
used to show the alternatives while the user is typing a word(if it is what
you are asking). See the discusssion "Alternative to looping through Hits".

DIGY

-----Original Message-----
From: Eric Advincula [mailto:Eric.Advincula@co.mohave.az.us] 
Sent: Friday, October 30, 2009 10:54 PM
To: lucene-net-user@incubator.apache.org 
Subject: RE: Best way to store book information

How does that work though if i dont store the "Content" to yes?  If I dont
store it then i cant search from it can I?.  What I do is search the
"Content" and use the "Title" and "File" to retrieve the actual html page
which is in a directory path.  Can I still search in the "Content" if i dont
store it?

Should I use Vectors also when storing?  If so which one?

Will TermEnum work for searching like "SQL Server database tuning" as a
search?
Do you happen to have an example on doing a search using TermEnum?

>>> 

From: "Digy" <di...@gmail.com>
To:<lu...@incubator.apache.org>
Date: 10/30/2009 1:41 PM
Subject: RE: Best way to store book information
1. If you want to return the field's content to the user then use
"Store.YES", otherwise no need to store it. 
In your case, "Content" can be as "Store.NO" since whole html doc is rarely
returned to the user.
2. if you want to give some "priority" to a specific field/term then use
boosting. For ex, some html pages thought to be important can be boosted.
3. Use TermEnum

DIGY

-----Original Message-----
From: Eric Advincula [mailto:Eric.Advincula@co.mohave.az.us] 
Sent: Friday, October 30, 2009 10:17 PM
To: lucene-net-user@incubator.apache.org 
Subject: Best way to store book information

I have countless articles in html pages and i'm importing them and parsing
out the text only for my searching.  My question is what is the best way to
store the "Content"?

                                doc = new Document();
                                doc.Add(new Field("Title", title,
Field.Store.YES, Field.Index.UN_TOKENIZED));

                                doc.Add(new Field("File", page,
Field.Store.YES, Field.Index.UN_TOKENIZED));
                                content = ParseHTML(file);

                                doc.Add(new Field("Content", content.Trim(),
Field.Store.YES, Field.Index.TOKENIZED));
                                writer.AddDocument(doc);

I'm only searching the "Content" portion not the other two.  So my questions
are:

1.  Should I add Vectors when i save it?  If so which one
     Yes, With_Positions, With_Offsets, With_Position_Offsets
2.  Should I add boosting to this Field?
3.  What is the best way to search the content?  Something like when you
type in google?  

Thanks

!DSPAM:4aeb542c501105209328925!

Re:

Posted by Eric Advincula <Er...@co.mohave.az.us>.

Thanks,  What I mean about " SQL Server database tuning " was if i type that as a phrase I want to search for.  Or any kind of phrase that I would like to search on not just one word searches but entire phrases.  Would TermEnums still work?

>>> 

From: "Digy" <di...@gmail.com>
To:<lu...@incubator.apache.org>
Date: 10/30/2009 2:07 PM
mohave.az.us>
In-Reply-To: <4A...@co.mohave.az.us>
Subject: RE: Best way to store book information
Date: Fri, 30 Oct 2009 23:05:18 +0200
Message-ID: <00...@com>
MIME-Version: 1.0
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: Microsoft Office Outlook 12.0
Thread-Index: AcpZo0XaTMlfFFugRWe+iJEwSQL44gAADsWQ
Content-Language: tr
X-Virus-Checked: Checked by ClamAV on apache.org
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Fri Oct 30 14:01:32 2009
X-DSPAM-Confidence: 0.9899
X-DSPAM-Probability: 0.0000
X-DSPAM-Signature: 4aeb542c501105209328925
X-DSPAM-Factors: 27,
List-Post*net, 0.01000,
content+to, 0.01000,
Content-Type*charset="us, 0.01000,
and+i'm, 0.01000,
X-Spam-Status*8.0, 0.01000,
List-Id*net+user.incubator.apache.org>, 0.01000,
Received*(hermes.apache.org+[140.211.11.3]), 0.01000,
Received*co.mohave.az.us>, 0.01000,
Subject*RE, 0.01000,
Received-SPF*(nike.apache.org, 0.01000,
Delivered-To*lucene, 0.01000,
have+no, 0.01000,
an, 0.01000,
an, 0.01000,
importing, 0.01000,
Received*(Postfix+from, 0.01000,
10, 0.01000,
10, 0.01000,
Index, 0.01000,
Index, 0.01000,
Subject*information, 0.01000,
doing+a, 0.01000,
doing+a, 0.01000,
org, 0.01000,
org, 0.01000,
In-Reply-To*co.mohave.az.us>, 0.01000,
What+is, 0.01000

>How does that work though if i dont store the "Content" to yes?  If I dont
store it then i cant search from it can I?.  What I do is search the
"Content" and use the "Title" and "File" to retrieve the actual html page
which is in a directory path.  Can I still search in the "Content" if i dont
store it?

YES. 

> Should I use Vectors also when storing?  If so which one?
NO NEED. 

>Will TermEnum work for searching like "SQL Server database tuning" as a
search?
>Do you happen to have an example on doing a search using TermEnum?

I have no idea about " SQL Server database tuning ". But TermEnum can be
used to show the alternatives while the user is typing a word(if it is what
you are asking). See the discusssion "Alternative to looping through Hits".

DIGY

-----Original Message-----
From: Eric Advincula [mailto:Eric.Advincula@co.mohave.az.us] 
Sent: Friday, October 30, 2009 10:54 PM
To: lucene-net-user@incubator.apache.org 
Subject: RE: Best way to store book information

How does that work though if i dont store the "Content" to yes?  If I dont
store it then i cant search from it can I?.  What I do is search the
"Content" and use the "Title" and "File" to retrieve the actual html page
which is in a directory path.  Can I still search in the "Content" if i dont
store it?

Should I use Vectors also when storing?  If so which one?

Will TermEnum work for searching like "SQL Server database tuning" as a
search?
Do you happen to have an example on doing a search using TermEnum?

>>> 

From: "Digy" <di...@gmail.com>
To:<lu...@incubator.apache.org>
Date: 10/30/2009 1:41 PM
Subject: RE: Best way to store book information
1. If you want to return the field's content to the user then use
"Store.YES", otherwise no need to store it. 
In your case, "Content" can be as "Store.NO" since whole html doc is rarely
returned to the user.
2. if you want to give some "priority" to a specific field/term then use
boosting. For ex, some html pages thought to be important can be boosted.
3. Use TermEnum

DIGY

-----Original Message-----
From: Eric Advincula [mailto:Eric.Advincula@co.mohave.az.us] 
Sent: Friday, October 30, 2009 10:17 PM
To: lucene-net-user@incubator.apache.org 
Subject: Best way to store book information

I have countless articles in html pages and i'm importing them and parsing
out the text only for my searching.  My question is what is the best way to
store the "Content"?

                                doc = new Document();
                                doc.Add(new Field("Title", title,
Field.Store.YES, Field.Index.UN_TOKENIZED));

                                doc.Add(new Field("File", page,
Field.Store.YES, Field.Index.UN_TOKENIZED));
                                content = ParseHTML(file);

                                doc.Add(new Field("Content", content.Trim(),
Field.Store.YES, Field.Index.TOKENIZED));
                                writer.AddDocument(doc);

I'm only searching the "Content" portion not the other two.  So my questions
are:

1.  Should I add Vectors when i save it?  If so which one
     Yes, With_Positions, With_Offsets, With_Position_Offsets
2.  Should I add boosting to this Field?
3.  What is the best way to search the content?  Something like when you
type in google?  

Thanks

!DSPAM:4aeb542c501105209328925!

RE: Best way to store book information

Posted by Digy <di...@gmail.com>.

1. If you want to return the field's content to the user then use
"Store.YES", otherwise no need to store it. 
In your case, "Content" can be as "Store.NO" since whole html doc is rarely
returned to the user.
2. if you want to give some "priority" to a specific field/term then use
boosting. For ex, some html pages thought to be important can be boosted.
3. Use TermEnum

DIGY


-----Original Message-----
From: Eric Advincula [mailto:Eric.Advincula@co.mohave.az.us] 
Sent: Friday, October 30, 2009 10:17 PM
To: lucene-net-user@incubator.apache.org
Subject: Best way to store book information

I have countless articles in html pages and i'm importing them and parsing
out the text only for my searching.  My question is what is the best way to
store the "Content"?
 
                                doc = new Document();
                                doc.Add(new Field("Title", title,
Field.Store.YES, Field.Index.UN_TOKENIZED));

                                doc.Add(new Field("File", page,
Field.Store.YES, Field.Index.UN_TOKENIZED));
                                content = ParseHTML(file);
 

                                doc.Add(new Field("Content", content.Trim(),
Field.Store.YES, Field.Index.TOKENIZED));
                                writer.AddDocument(doc);
 
I'm only searching the "Content" portion not the other two.  So my questions
are:

1.  Should I add Vectors when i save it?  If so which one
     Yes, With_Positions, With_Offsets, With_Position_Offsets
2.  Should I add boosting to this Field?
3.  What is the best way to search the content?  Something like when you
type in google?  
 
Thanks