You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Daniel Cortes <dc...@fib.upc.edu> on 2004/12/20 10:08:45 UTC

Number of documents

I've to show to my boss if Lucene is the best option for create a search 
engine of a new portal.
I want to now how many documents do you have in your index?
And how many bigger is your DB?
the types of formats who has to support the portal are html jsp txt doc 
pdf ppt

another question that I have is:
I'm playing with the files of the book Lucene in Action and I try to use 
the example of handling types.The folder data contains 5 files, and 
created index contain five
documents what the only one that contains any word in the index is the 
.html file
Everybody have the same result?


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Number of documents

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Dec 20, 2004, at 4:08 AM, Daniel Cortes wrote:
> I've to show to my boss if Lucene is the best option for create a 
> search engine of a new portal.
> I want to now how many documents do you have in your index?
> And how many bigger is your DB?

I highly recommend you use Luke to examine the index.  It is a great 
tool to have handy.  It shows these statistics and many others.

> the types of formats who has to support the portal are html jsp txt 
> doc pdf ppt

HTML, TXT, DOC, and PDF are all quite straightforward to do.  PPT is 
possible, perhaps POI will do the trick.  JSP depends on how you want 
to analyze it.  If any text in the file should be indexed (including 
JSP directives, taglibs, and HTML) then you can treat it as a text 
file.  If you need to eliminate the tags then you'll need to parse the 
JSP somehow, however I strongly recommend that content not reside in 
JSP pages but rather in a content management system, database, or such.

> another question that I have is:
> I'm playing with the files of the book Lucene in Action and I try to 
> use the example of handling types.The folder data contains 5 files, 
> and created index contain five
> documents what the only one that contains any word in the index is the 
> .html file
> Everybody have the same result?

Perhaps you are taking the output you see from "ant 
ExtensionFileHandler" as an indication of what words were indexed.  
This output, however, is showing Document.toString() which only shows 
the text in stored fields.  This particular example does not actually 
index the documents - it shows the generalized handling framework and 
the parsing of the files into a Lucene Document.  Most of the file 
handlers use unstored fields.  The output I get is shown below.  The 
handlers have successfully extracted the text from the files.  Maybe 
you're referring to the FileIndexer example?  We did not expose this 
one to the Ant launcher.  If FileIndexer is the code you're trying, let 
me know what you've tried and how you're looking for the words that you 
expect to see.  Again, most of the fields are unstored (meaning the 
original content is not stored in the index, only the terms extracted 
through analysis).

	Erik

# to make the output cleaner for e-mailing I set ANT_ARGS like this:
% echo $ANT_ARGS
-logger org.apache.tools.ant.NoBannerLogger -emacs -Dnopause=true

% ant ExtensionFileHandler 
-Dfile=src/lia/handlingtypes/data/addressbook-entry.xml
Buildfile: build.xml

ExtensionFileHandler:

       This example demonstrates the file extension document handler.
       Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt 
are
       all handled by the framework.  The contents of the Lucene Document
       built for the specified file is displayed.

skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
log4j:WARN No appenders could be found for logger 
(org.apache.commons.digester.Digester.sax).
log4j:WARN Please initialize the log4j system properly.
Document<Keyword<type:individual> Keyword<name:Zane Pasolini> 
Keyword<address:999 W. Prince St.> Keyword<city:New York> 
Keyword<province:NY> Keyword<postalcode:10013> Keyword<country:USA> 
Keyword<telephone:+1 212 345 6789>>

% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/HTML.html
Buildfile: build.xml

ExtensionFileHandler:

       This example demonstrates the file extension document handler.
       Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt 
are
       all handled by the framework.  The contents of the Lucene Document
       built for the specified file is displayed.

skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
Document<Text<title:Laptop power supplies are available in First Class 
only> Text<body:Code, Write, Fly This chapter is being written 11,000 
meters above New Foundland.>>

% ant ExtensionFileHandler 
-Dfile=src/lia/handlingtypes/data/PlainText.txt
Buildfile: build.xml

ExtensionFileHandler:

       This example demonstrates the file extension document handler.
       Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt 
are
       all handled by the framework.  The contents of the Lucene Document
       built for the specified file is displayed.

skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
Document<UnStored<body>>

% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/PDF.pdf
Buildfile: build.xml

ExtensionFileHandler:

       This example demonstrates the file extension document handler.
       Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt 
are
       all handled by the framework.  The contents of the Lucene Document
       built for the specified file is displayed.

skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
log4j:WARN No appenders could be found for logger 
(org.pdfbox.pdfparser.PDFParser).
log4j:WARN Please initialize the log4j system properly.
Document<UnStored<body>>

% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/RTF.rtf
Buildfile: build.xml

ExtensionFileHandler:

       This example demonstrates the file extension document handler.
       Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt 
are
       all handled by the framework.  The contents of the Lucene Document
       built for the specified file is displayed.

skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
Document<UnStored<body>>

% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/MSWord.doc
Buildfile: build.xml

ExtensionFileHandler:

       This example demonstrates the file extension document handler.
       Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt 
are
       all handled by the framework.  The contents of the Lucene Document
       built for the specified file is displayed.

skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
Document<UnStored<body>>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org