You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Hetan Shah <He...@Sun.COM> on 2005/01/06 00:31:27 UTC
Indexing flat files with out .txt extension
Hello,
How can one index simple text files with out the .txt extension. I am
trying to use the IndexFiles and IndexHTML but not to my satisfaction.
In the IndexFiles I do not get any control over the content of the file
and in case of IndexHTML the files with out any extension do not get
index all together. Any pointers are really appreciated.
Thanks.
-H
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Span Query Performance
Posted by Paul Elschot <pa...@xs4all.nl>.
Sorry for the duplicate on lucene-dev, it should have gone to lucene-user
directly:
A bit more:
On Thursday 06 January 2005 10:22, Paul Elschot wrote:
> On Thursday 06 January 2005 02:17, Andrew Cunningham wrote:
> > Hi all,
> >
> > I'm currently doing a query similar to the following:
> >
> > for w in wordset:
> > query = w near (word1 V word2 V word3 ... V word1422);
> > perform query
> >
> > and I am doing this through SpanQuery.getSpans(), iterating through the
> > spans and counting
> > the matches, which can result in 4782282 matches (essentially I am only
> > after the match count).
> > The query works but the performance can be somewhat slow; so I am
wondering:
> >
...
> > c) Is there a faster method to what I am doing I should consider?
>
> Preindexing all word combinations that you're interested in.
>
In case you know all the words in advance, you could also index a
helper word at the same position as each of those words.
This requires a custom analyzer that inserts the helper word in the
token stream with a zero position increment.
The query then simplifies to:
query = w near helperword
which would probably speed things up significantly.
Regards,
Paul Elschot
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Span Query Performance
Posted by Paul Elschot <pa...@xs4all.nl>.
A bit more:
On Thursday 06 January 2005 10:22, Paul Elschot wrote:
> On Thursday 06 January 2005 02:17, Andrew Cunningham wrote:
> > Hi all,
> >
> > I'm currently doing a query similar to the following:
> >
> > for w in wordset:
> > query = w near (word1 V word2 V word3 ... V word1422);
> > perform query
> >
> > and I am doing this through SpanQuery.getSpans(), iterating through the
> > spans and counting
> > the matches, which can result in 4782282 matches (essentially I am only
> > after the match count).
> > The query works but the performance can be somewhat slow; so I am
wondering:
> >
...
> > c) Is there a faster method to what I am doing I should consider?
>
> Preindexing all word combinations that you're interested in.
>
In case you know all the words in advance, you could also index a
helper word at the same position as each of those words.
This requires a custom analyzer that inserts the helper word in the
token stream with a zero position increment.
The query then simplifies to:
query = w near helperword
which would probably speed things up significantly.
Regards,
Paul Elschot
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Span Query Performance
Posted by Paul Elschot <pa...@xs4all.nl>.
On Thursday 06 January 2005 02:17, Andrew Cunningham wrote:
> Hi all,
>
> I'm currently doing a query similar to the following:
>
> for w in wordset:
> query = w near (word1 V word2 V word3 ... V word1422);
> perform query
>
> and I am doing this through SpanQuery.getSpans(), iterating through the
> spans and counting
> the matches, which can result in 4782282 matches (essentially I am only
> after the match count).
> The query works but the performance can be somewhat slow; so I am wondering:
>
> a) Would the query potentially run faster if I used
> Searcher.search(query) with a custom similarity,
> or do both methods essentially use the same mechanics
It would be somewhat slower, because it loops over the getSpans()
and computes document scores and constructs a Hits from the scores.
> b) Does using a RAMDirectory improve query performance any significant
> amount.
That depends on your operating system, the size of the index, the amount
of RAM you can use, the file buffering efficiency, other loads on the
computer ...
> c) Is there a faster method to what I am doing I should consider?
Preindexing all word combinations that you're interested in.
Regards,
Paul Elschot
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Span Query Performance
Posted by Andrew Cunningham <cu...@csiro.au>.
Hi all,
I'm currently doing a query similar to the following:
for w in wordset:
query = w near (word1 V word2 V word3 ... V word1422);
perform query
and I am doing this through SpanQuery.getSpans(), iterating through the
spans and counting
the matches, which can result in 4782282 matches (essentially I am only
after the match count).
The query works but the performance can be somewhat slow; so I am wondering:
a) Would the query potentially run faster if I used
Searcher.search(query) with a custom similarity,
or do both methods essentially use the same mechanics
b) Does using a RAMDirectory improve query performance any significant
amount.
c) Is there a faster method to what I am doing I should consider?
Thanks,
Andrew
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing flat files with out .txt extension
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 11, 2005, at 7:28 PM, Hetan Shah wrote:
> Thanks for the pointers, I have modified the Indexer.java to index the
> files from the directory by removing the file extenstion check of
> (".txt"). Now I do get the index from the files.
...
>
> java org.apache.lucene.demo.SearchFiles
The problem is you're using the SearchFiles demo code, which uses
different field names than Indexer.java. You need to be sure the
searching and indexing code agree on the field names. Since you
borrowed from Indexer.java from LIA, keep borrowing from Searcher.java.
You can run "ant Searcher" from the LIA source code.
Be sure to really learn what's going on in that code rather than just
accepting what its doing - this will pay off as you continue to evolve
your application. Indexer.java has only 6 (effective) lines of code
tied to Lucene's API, and similarly very few lines of Lucene-dependent
code in Searcher.java. All of this is demo code, and is designed to be
adapted to your needs.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing flat files with out .txt extension
Posted by Hetan Shah <He...@Sun.COM>.
Hi Erik,
Thanks for the pointers, I have modified the Indexer.java to index the
files from the directory by removing the file extenstion check of
(".txt"). Now I do get the index from the files.
New situation is that when I run the FileSearch
java org.apache.lucene.demo.SearchFiles
Query: tty
Searching for: tty
3 total matching documents
0. No path nor URL for this document
1. No path nor URL for this document
2. No path nor URL for this document
I do not get the actual path from the index and using Luke I get the
three hits. Last two are from the index and not the real documents.
Any idea what is happeneing and how can I fix it.
Thanks.
-H
Erik Hatcher wrote:
> On Jan 10, 2005, at 7:06 PM, Hetan Shah wrote:
>
>>Got the latest Ant and got the demo to work. I am however not sure
>>which part in the whole source code is the indexing for different file
>>types is done, say for example .html .txt and such?
>
>
> Your best bet is to dig around in the codebase. The Indexer.java code
> is hard-coded to only do .txt file extensions - this was on purpose as
> the first example in the book, figuring someone using this code on the
> their C:\ drive would be relatively safe and fast to run.
>
> Their is also an example easily run from the Ant launcher to show how
> various document types can be handled using an extensible framework.
> Run "ant ExtensionFileHandler". It doesn't actually index the document
> it creates, but displays it to the console. It would be pretty trivial
> to pair the Indexer.java code up with the file handler framework to
> crawl a directory tree and index any content it recognizes.
>
>
>>Appreciate your help. If you have any sample code would certainly
>>appreciate that also.
>
>
> You got all the code already. It should be fairly straightforward to
> navigate the src tree, especially with the Table of Contents handy:
>
> http://www.lucenebook.com/toc
>
> (incidentally, this dynamic TOC page is blending the blog content with
> the TOC using an IndexReader to find all blog entries that refer to
> each section - and you'll see the two, minor and cosmetic, errata
> listed there already).
>
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing flat files with out .txt extension
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 10, 2005, at 7:06 PM, Hetan Shah wrote:
> Got the latest Ant and got the demo to work. I am however not sure
> which part in the whole source code is the indexing for different file
> types is done, say for example .html .txt and such?
Your best bet is to dig around in the codebase. The Indexer.java code
is hard-coded to only do .txt file extensions - this was on purpose as
the first example in the book, figuring someone using this code on the
their C:\ drive would be relatively safe and fast to run.
Their is also an example easily run from the Ant launcher to show how
various document types can be handled using an extensible framework.
Run "ant ExtensionFileHandler". It doesn't actually index the document
it creates, but displays it to the console. It would be pretty trivial
to pair the Indexer.java code up with the file handler framework to
crawl a directory tree and index any content it recognizes.
> Appreciate your help. If you have any sample code would certainly
> appreciate that also.
You got all the code already. It should be fairly straightforward to
navigate the src tree, especially with the Table of Contents handy:
http://www.lucenebook.com/toc
(incidentally, this dynamic TOC page is blending the blog content with
the TOC using an IndexReader to find all blog entries that refer to
each section - and you'll see the two, minor and cosmetic, errata
listed there already).
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing flat files with out .txt extension
Posted by Hetan Shah <He...@Sun.COM>.
Hi erik,
Got the latest Ant and got the demo to work. I am however not sure which
part in the whole source code is the indexing for different file types
is done, say for example .html .txt and such? From there I can derive
how can I index a plain text file which does not have any extension.
Appreciate your help. If you have any sample code would certainly
appreciate that also.
-H.
Erik Hatcher wrote:
> On Jan 6, 2005, at 6:49 PM, Hetan Shah wrote:
>
>> Hi Erik,
>>
>> I got the source downloaded and unpacked. I am having difficulty in
>> building and of the modules. Maybe something's wrong with my Ant
>> installation.
>> ************************
>> LuceneInAction% ant test
>> Buildfile: build.xml
>>
>> BUILD FAILED
>> file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element
>> "available"
>
>
> The good ol' README says this:
>
> R E Q U I R E M E N T S
> -----------------------
> * JDK 1.4+
> * Ant 1.6+ (to run the automated examples)
> * JUnit 3.8.1+
> - junit.jar should be in ANT_HOME/lib
>
> You are not running Ant 1.6, I'm sure. Upgrade your version of Ant,
> and of course follow the rest of the README and all should be well.
>
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing flat files with out .txt extension
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 6, 2005, at 6:49 PM, Hetan Shah wrote:
> Hi Erik,
>
> I got the source downloaded and unpacked. I am having difficulty in
> building and of the modules. Maybe something's wrong with my Ant
> installation.
> ************************
> LuceneInAction% ant test
> Buildfile: build.xml
>
> BUILD FAILED
> file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element
> "available"
The good ol' README says this:
R E Q U I R E M E N T S
-----------------------
* JDK 1.4+
* Ant 1.6+ (to run the automated examples)
* JUnit 3.8.1+
- junit.jar should be in ANT_HOME/lib
You are not running Ant 1.6, I'm sure. Upgrade your version of Ant,
and of course follow the rest of the README and all should be well.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing flat files with out .txt extension
Posted by Hetan Shah <He...@Sun.COM>.
Hi Erik,
I got the source downloaded and unpacked. I am having difficulty in
building and of the modules. Maybe something's wrong with my Ant
installation.
************************
LuceneInAction% ant test
Buildfile: build.xml
BUILD FAILED
file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element
"available"
Total time: 5 seconds
LuceneInAction% ant Indexer
Buildfile: build.xml
BUILD FAILED
file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element
"available"
Total time: 5 seconds
**********************
Can you point me to proper module for creating my own indexer? I tried
looking into the indexing module but was not sure.
TIA,
-H
Erik Hatcher wrote:
>
> On Jan 5, 2005, at 6:31 PM, Hetan Shah wrote:
>
>> How can one index simple text files with out the .txt extension. I am
>> trying to use the IndexFiles and IndexHTML but not to my
>> satisfaction. In the IndexFiles I do not get any control over the
>> content of the file and in case of IndexHTML the files with out any
>> extension do not get index all together. Any pointers are really
>> appreciated.
>
>
> Try out the Indexer code from Lucene in Action. You can download it
> from the link here:
> http://www.lucenebook.com/blog/announcements/sourcecode.html
>
> It'll be cleaner to follow and borrow from. The code that ships with
> Lucene is for demonstration purposes. It surprises me how often folks
> use that code to build real indexes. It's quite straightforward to
> create your own Java code to do the indexing in whatever manner you
> like, borrowing from examples.
>
> When you get the download unpacked, simply run "ant Indexer" to see it
> in action. And then "ant Searcher" to search the index just built.
>
> Erik
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing flat files with out .txt extension
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 5, 2005, at 6:31 PM, Hetan Shah wrote:
> How can one index simple text files with out the .txt extension. I am
> trying to use the IndexFiles and IndexHTML but not to my satisfaction.
> In the IndexFiles I do not get any control over the content of the
> file and in case of IndexHTML the files with out any extension do not
> get index all together. Any pointers are really appreciated.
Try out the Indexer code from Lucene in Action. You can download it
from the link here:
http://www.lucenebook.com/blog/announcements/sourcecode.html
It'll be cleaner to follow and borrow from. The code that ships with
Lucene is for demonstration purposes. It surprises me how often folks
use that code to build real indexes. It's quite straightforward to
create your own Java code to do the indexing in whatever manner you
like, borrowing from examples.
When you get the download unpacked, simply run "ant Indexer" to see it
in action. And then "ant Searcher" to search the index just built.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org