You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Daniel Cortes <dc...@fib.upc.edu> on 2004/12/20 10:08:45 UTC
Number of documents
I've to show to my boss if Lucene is the best option for create a search
engine of a new portal.
I want to now how many documents do you have in your index?
And how many bigger is your DB?
the types of formats who has to support the portal are html jsp txt doc
pdf ppt
another question that I have is:
I'm playing with the files of the book Lucene in Action and I try to use
the example of handling types.The folder data contains 5 files, and
created index contain five
documents what the only one that contains any word in the index is the
.html file
Everybody have the same result?
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Number of documents
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Dec 20, 2004, at 4:08 AM, Daniel Cortes wrote:
> I've to show to my boss if Lucene is the best option for create a
> search engine of a new portal.
> I want to now how many documents do you have in your index?
> And how many bigger is your DB?
I highly recommend you use Luke to examine the index. It is a great
tool to have handy. It shows these statistics and many others.
> the types of formats who has to support the portal are html jsp txt
> doc pdf ppt
HTML, TXT, DOC, and PDF are all quite straightforward to do. PPT is
possible, perhaps POI will do the trick. JSP depends on how you want
to analyze it. If any text in the file should be indexed (including
JSP directives, taglibs, and HTML) then you can treat it as a text
file. If you need to eliminate the tags then you'll need to parse the
JSP somehow, however I strongly recommend that content not reside in
JSP pages but rather in a content management system, database, or such.
> another question that I have is:
> I'm playing with the files of the book Lucene in Action and I try to
> use the example of handling types.The folder data contains 5 files,
> and created index contain five
> documents what the only one that contains any word in the index is the
> .html file
> Everybody have the same result?
Perhaps you are taking the output you see from "ant
ExtensionFileHandler" as an indication of what words were indexed.
This output, however, is showing Document.toString() which only shows
the text in stored fields. This particular example does not actually
index the documents - it shows the generalized handling framework and
the parsing of the files into a Lucene Document. Most of the file
handlers use unstored fields. The output I get is shown below. The
handlers have successfully extracted the text from the files. Maybe
you're referring to the FileIndexer example? We did not expose this
one to the Ant launcher. If FileIndexer is the code you're trying, let
me know what you've tried and how you're looking for the words that you
expect to see. Again, most of the fields are unstored (meaning the
original content is not stored in the index, only the terms extracted
through analysis).
Erik
# to make the output cleaner for e-mailing I set ANT_ARGS like this:
% echo $ANT_ARGS
-logger org.apache.tools.ant.NoBannerLogger -emacs -Dnopause=true
% ant ExtensionFileHandler
-Dfile=src/lia/handlingtypes/data/addressbook-entry.xml
Buildfile: build.xml
ExtensionFileHandler:
This example demonstrates the file extension document handler.
Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt
are
all handled by the framework. The contents of the Lucene Document
built for the specified file is displayed.
skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
log4j:WARN No appenders could be found for logger
(org.apache.commons.digester.Digester.sax).
log4j:WARN Please initialize the log4j system properly.
Document<Keyword<type:individual> Keyword<name:Zane Pasolini>
Keyword<address:999 W. Prince St.> Keyword<city:New York>
Keyword<province:NY> Keyword<postalcode:10013> Keyword<country:USA>
Keyword<telephone:+1 212 345 6789>>
% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/HTML.html
Buildfile: build.xml
ExtensionFileHandler:
This example demonstrates the file extension document handler.
Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt
are
all handled by the framework. The contents of the Lucene Document
built for the specified file is displayed.
skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
Document<Text<title:Laptop power supplies are available in First Class
only> Text<body:Code, Write, Fly This chapter is being written 11,000
meters above New Foundland.>>
% ant ExtensionFileHandler
-Dfile=src/lia/handlingtypes/data/PlainText.txt
Buildfile: build.xml
ExtensionFileHandler:
This example demonstrates the file extension document handler.
Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt
are
all handled by the framework. The contents of the Lucene Document
built for the specified file is displayed.
skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
Document<UnStored<body>>
% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/PDF.pdf
Buildfile: build.xml
ExtensionFileHandler:
This example demonstrates the file extension document handler.
Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt
are
all handled by the framework. The contents of the Lucene Document
built for the specified file is displayed.
skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
log4j:WARN No appenders could be found for logger
(org.pdfbox.pdfparser.PDFParser).
log4j:WARN Please initialize the log4j system properly.
Document<UnStored<body>>
% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/RTF.rtf
Buildfile: build.xml
ExtensionFileHandler:
This example demonstrates the file extension document handler.
Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt
are
all handled by the framework. The contents of the Lucene Document
built for the specified file is displayed.
skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
Document<UnStored<body>>
% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/MSWord.doc
Buildfile: build.xml
ExtensionFileHandler:
This example demonstrates the file extension document handler.
Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt
are
all handled by the framework. The contents of the Lucene Document
built for the specified file is displayed.
skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
Document<UnStored<body>>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org