You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Hetan Shah <He...@Sun.COM> on 2005/01/06 00:31:27 UTC

Indexing flat files with out .txt extension

Hello,

How can one index simple text files with out the .txt extension. I am 
trying to use the IndexFiles and IndexHTML but not to my satisfaction. 
In the IndexFiles I do not get any control over the content of the file 
and in case of IndexHTML the files with out any extension do not get 
index all together. Any pointers are really appreciated.

Thanks.
-H


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Span Query Performance

Posted by Paul Elschot <pa...@xs4all.nl>.

Sorry for the duplicate on lucene-dev, it should have gone to lucene-user 
directly:

A bit more:

On Thursday 06 January 2005 10:22, Paul Elschot wrote:
> On Thursday 06 January 2005 02:17, Andrew Cunningham wrote:
> > Hi all,
> > 
> > I'm currently doing a query similar to the following:
> > 
> > for w in wordset:
> >     query = w near (word1 V word2 V word3 ... V word1422);
> >     perform query
> > 
> > and I am doing this through SpanQuery.getSpans(), iterating through the 
> > spans and counting
> > the matches, which can result in 4782282 matches (essentially I am only 
> > after the match count).
> > The query works but the performance can be somewhat slow; so I am 
wondering:
> > 
...
> > c) Is there a faster method to what I am doing I should consider?
> 
> Preindexing all word combinations that you're interested in.
> 

In case you know all the words in advance, you could also index a
helper word at the same position as each of those words.
This requires a custom analyzer that inserts the helper word in the
token stream with a zero position increment.
The query then simplifies to:
query = w near helperword
which would probably speed things up significantly.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Span Query Performance

Posted by Paul Elschot <pa...@xs4all.nl>.

A bit more:

On Thursday 06 January 2005 10:22, Paul Elschot wrote:
> On Thursday 06 January 2005 02:17, Andrew Cunningham wrote:
> > Hi all,
> > 
> > I'm currently doing a query similar to the following:
> > 
> > for w in wordset:
> >     query = w near (word1 V word2 V word3 ... V word1422);
> >     perform query
> > 
> > and I am doing this through SpanQuery.getSpans(), iterating through the 
> > spans and counting
> > the matches, which can result in 4782282 matches (essentially I am only 
> > after the match count).
> > The query works but the performance can be somewhat slow; so I am 
wondering:
> > 
...
> > c) Is there a faster method to what I am doing I should consider?
> 
> Preindexing all word combinations that you're interested in.
> 

In case you know all the words in advance, you could also index a
helper word at the same position as each of those words.
This requires a custom analyzer that inserts the helper word in the
token stream with a zero position increment.
The query then simplifies to:
query = w near helperword
which would probably speed things up significantly.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Span Query Performance

Posted by Paul Elschot <pa...@xs4all.nl>.

On Thursday 06 January 2005 02:17, Andrew Cunningham wrote:
> Hi all,
> 
> I'm currently doing a query similar to the following:
> 
> for w in wordset:
>     query = w near (word1 V word2 V word3 ... V word1422);
>     perform query
> 
> and I am doing this through SpanQuery.getSpans(), iterating through the 
> spans and counting
> the matches, which can result in 4782282 matches (essentially I am only 
> after the match count).
> The query works but the performance can be somewhat slow; so I am wondering:
> 
> a) Would the query potentially run faster if I used 
> Searcher.search(query) with a custom similarity,
> or do both methods essentially use the same mechanics

It would be somewhat slower, because it loops over the getSpans()
and computes document scores and constructs a Hits from the scores.

> b) Does using a RAMDirectory improve query performance any significant 
> amount.

That depends on your operating system, the size of the index, the amount
of RAM you can use, the file buffering efficiency, other loads on the 
computer ...
 
> c) Is there a faster method to what I am doing I should consider?

Preindexing all word combinations that you're interested in.

Regards,
Paul Elschot
 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Span Query Performance

Posted by Andrew Cunningham <cu...@csiro.au>.

Hi all,

I'm currently doing a query similar to the following:

for w in wordset:
    query = w near (word1 V word2 V word3 ... V word1422);
    perform query

and I am doing this through SpanQuery.getSpans(), iterating through the 
spans and counting
the matches, which can result in 4782282 matches (essentially I am only 
after the match count).
The query works but the performance can be somewhat slow; so I am wondering:

a) Would the query potentially run faster if I used 
Searcher.search(query) with a custom similarity,
or do both methods essentially use the same mechanics

b) Does using a RAMDirectory improve query performance any significant 
amount.

c) Is there a faster method to what I am doing I should consider?

Thanks,
Andrew

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing flat files with out .txt extension

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 11, 2005, at 7:28 PM, Hetan Shah wrote:
> Thanks for the pointers, I have modified the Indexer.java to index the
> files from the directory by removing the file extenstion check of
> (".txt"). Now I do get the index from the files.

...

>
> java org.apache.lucene.demo.SearchFiles

The problem is you're using the SearchFiles demo code, which uses 
different field names than Indexer.java.  You need to be sure the 
searching and indexing code agree on the field names.  Since you 
borrowed from Indexer.java from LIA, keep borrowing from Searcher.java. 
  You can run "ant Searcher" from the LIA source code.

Be sure to really learn what's going on in that code rather than just 
accepting what its doing - this will pay off as you continue to evolve 
your application.  Indexer.java has only 6 (effective) lines of code 
tied to Lucene's API, and similarly very few lines of Lucene-dependent 
code in Searcher.java.  All of this is demo code, and is designed to be 
adapted to your needs.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing flat files with out .txt extension

Posted by Hetan Shah <He...@Sun.COM>.

Hi Erik,

Thanks for the pointers, I have modified the Indexer.java to index the
files from the directory by removing the file extenstion check of
(".txt"). Now I do get the index from the files.

New situation is that when I run the FileSearch

java org.apache.lucene.demo.SearchFiles
Query: tty
Searching for: tty
3 total matching documents
0. No path nor URL for this document
1. No path nor URL for this document
2. No path nor URL for this document

I do not get the actual path from the index and using Luke I get the
three hits. Last two are from the index and not the real documents.

Any idea what is happeneing and how can I fix it.

Thanks.
-H

Erik Hatcher wrote:
> On Jan 10, 2005, at 7:06 PM, Hetan Shah wrote:
> 
>>Got the latest Ant and got the demo to work. I am however not sure 
>>which part in the whole source code is the indexing for different file 
>>types is done, say for example .html .txt and such?
> 
> 
> Your best bet is to dig around in the codebase.  The Indexer.java code 
> is hard-coded to only do .txt file extensions - this was on purpose as 
> the first example in the book, figuring someone using this code on the 
> their C:\ drive would be relatively safe and fast to run.
> 
> Their is also an example easily run from the Ant launcher to show how 
> various document types can be handled using an extensible framework.  
> Run "ant ExtensionFileHandler".  It doesn't actually index the document 
> it creates, but displays it to the console.  It would be pretty trivial 
> to pair the Indexer.java code up with the file handler framework to 
> crawl a directory tree and index any content it recognizes.
> 
> 
>>Appreciate your help. If you have any sample code would certainly 
>>appreciate that also.
> 
> 
> You got all the code already.  It should be fairly straightforward to 
> navigate the src tree, especially with the Table of Contents handy:
> 
> 	http://www.lucenebook.com/toc
> 
> (incidentally, this dynamic TOC page is blending the blog content with 
> the TOC using an IndexReader to find all blog entries that refer to 
> each section - and you'll see the two, minor and cosmetic, errata 
> listed there already).
> 
> 	Erik
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing flat files with out .txt extension

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 10, 2005, at 7:06 PM, Hetan Shah wrote:
> Got the latest Ant and got the demo to work. I am however not sure 
> which part in the whole source code is the indexing for different file 
> types is done, say for example .html .txt and such?

Your best bet is to dig around in the codebase.  The Indexer.java code 
is hard-coded to only do .txt file extensions - this was on purpose as 
the first example in the book, figuring someone using this code on the 
their C:\ drive would be relatively safe and fast to run.

Their is also an example easily run from the Ant launcher to show how 
various document types can be handled using an extensible framework.  
Run "ant ExtensionFileHandler".  It doesn't actually index the document 
it creates, but displays it to the console.  It would be pretty trivial 
to pair the Indexer.java code up with the file handler framework to 
crawl a directory tree and index any content it recognizes.

> Appreciate your help. If you have any sample code would certainly 
> appreciate that also.

You got all the code already.  It should be fairly straightforward to 
navigate the src tree, especially with the Table of Contents handy:

	http://www.lucenebook.com/toc

(incidentally, this dynamic TOC page is blending the blog content with 
the TOC using an IndexReader to find all blog entries that refer to 
each section - and you'll see the two, minor and cosmetic, errata 
listed there already).

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing flat files with out .txt extension

Posted by Hetan Shah <He...@Sun.COM>.

Hi erik,
Got the latest Ant and got the demo to work. I am however not sure which 
part in the whole source code is the indexing for different file types 
is done, say for example .html .txt and such? From there I can derive 
how can I index a plain text file which does not have any extension.

Appreciate your help. If you have any sample code would certainly 
appreciate that also.
-H.

Erik Hatcher wrote:

> On Jan 6, 2005, at 6:49 PM, Hetan Shah wrote:
>
>> Hi Erik,
>>
>> I got the source downloaded and unpacked. I am having difficulty in 
>> building and of the modules. Maybe something's wrong with my Ant 
>> installation.
>> ************************
>> LuceneInAction% ant test
>> Buildfile: build.xml
>>
>> BUILD FAILED
>> file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element 
>> "available"
>
>
> The good ol' README says this:
>
> R E Q U I R E M E N T S
> -----------------------
>   * JDK 1.4+
>   * Ant 1.6+ (to run the automated examples)
>   * JUnit 3.8.1+
>     - junit.jar should be in ANT_HOME/lib
>
> You are not running Ant 1.6, I'm sure.  Upgrade your version of Ant, 
> and of course follow the rest of the README and all should be well.
>
>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing flat files with out .txt extension

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 6, 2005, at 6:49 PM, Hetan Shah wrote:
> Hi Erik,
>
> I got the source downloaded and unpacked. I am having difficulty in 
> building and of the modules. Maybe something's wrong with my Ant 
> installation.
> ************************
> LuceneInAction% ant test
> Buildfile: build.xml
>
> BUILD FAILED
> file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element 
> "available"

The good ol' README says this:

R E Q U I R E M E N T S
-----------------------
   * JDK 1.4+
   * Ant 1.6+ (to run the automated examples)
   * JUnit 3.8.1+
     - junit.jar should be in ANT_HOME/lib

You are not running Ant 1.6, I'm sure.  Upgrade your version of Ant, 
and of course follow the rest of the README and all should be well.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing flat files with out .txt extension

Posted by Hetan Shah <He...@Sun.COM>.

Hi Erik,

I got the source downloaded and unpacked. I am having difficulty in 
building and of the modules. Maybe something's wrong with my Ant 
installation.
************************
LuceneInAction% ant test
Buildfile: build.xml

BUILD FAILED
file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element 
"available"

Total time: 5 seconds

LuceneInAction% ant Indexer
Buildfile: build.xml

BUILD FAILED
file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element 
"available"

Total time: 5 seconds
**********************
Can you point me to proper module for creating my own indexer? I tried 
looking into the indexing module but was not sure.

TIA,
-H

Erik Hatcher wrote:

>
> On Jan 5, 2005, at 6:31 PM, Hetan Shah wrote:
>
>> How can one index simple text files with out the .txt extension. I am 
>> trying to use the IndexFiles and IndexHTML but not to my 
>> satisfaction. In the IndexFiles I do not get any control over the 
>> content of the file and in case of IndexHTML the files with out any 
>> extension do not get index all together. Any pointers are really 
>> appreciated.
>
>
> Try out the Indexer code from Lucene in Action.  You can download it 
> from the link here: 
> http://www.lucenebook.com/blog/announcements/sourcecode.html
>
> It'll be cleaner to follow and borrow from.  The code that ships with 
> Lucene is for demonstration purposes.  It surprises me how often folks 
> use that code to build real indexes.  It's quite straightforward to 
> create your own Java code to do the indexing in whatever manner you 
> like, borrowing from examples.
>
> When you get the download unpacked, simply run "ant Indexer" to see it 
> in action.  And then "ant Searcher" to search the index just built.
>
>     Erik
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing flat files with out .txt extension

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 5, 2005, at 6:31 PM, Hetan Shah wrote:
> How can one index simple text files with out the .txt extension. I am 
> trying to use the IndexFiles and IndexHTML but not to my satisfaction. 
> In the IndexFiles I do not get any control over the content of the 
> file and in case of IndexHTML the files with out any extension do not 
> get index all together. Any pointers are really appreciated.

Try out the Indexer code from Lucene in Action.  You can download it 
from the link here: 
http://www.lucenebook.com/blog/announcements/sourcecode.html

It'll be cleaner to follow and borrow from.  The code that ships with 
Lucene is for demonstration purposes.  It surprises me how often folks 
use that code to build real indexes.  It's quite straightforward to 
create your own Java code to do the indexing in whatever manner you 
like, borrowing from examples.

When you get the download unpacked, simply run "ant Indexer" to see it 
in action.  And then "ant Searcher" to search the index just built.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org