You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2003/04/08 16:55:20 UTC

Re: New xdoc for parsers

Thanks Jeff, I put all this in the Lucene FAQ at jGuru.

Otis

--- Jeff Linwood <je...@greenninja.com> wrote:
> Since this question comes up all the time on the users list, and the
> FAQ 
> entry is umm...unhelpful :) I created an xdoc listing all the parsers
> I 
> knew about.
> 
> Yes, I know this sort of duplicates the resources.xml xdoc, but this
> is 
> is more descriptive, and more clear from the menu as to what it is. I
> hope.
> 
> I also included a patch to the project.xml stylesheet if this 
> contribution gets accepted.
> 
> Thanks,
> Jeff Linwood
> > <?xml version="1.0"?>
> <document>
>     <properties>
>     <author email="jeff@greenninja.com">Jeff Linwood</author>
>     <title>Parsers - Jakarta Lucene</title>
>     </properties>
>     <body>
> 
>     <section name="Introduction">
>         <p>
>         	Lucene is capable of indexing any file format for documents,
> but the application that uses the Lucene search engine is
>         	responsible for translating these document types into a
> format that Lucene can understand.  Several of these formats
>         	can be indexed with open source or free solutions, and links
> are given to the appropriate sites. Many Lucene users use
>         	more than one of these in their applications.
>         </p>
>     </section>
>     
>     <section name="HTML">
>         <subsection name="JavaCC and IndexHTML">
>         	An example that uses JavaCC to parse HTML into Lucene
> Document objects is provided in the <a href="demo3.html">Lucene web
>         	application demo</a> that comes with the Lucene
> distribution.
>         </subsection>
>         <subsection name="NekoHTML">
> 		The <a href="http://www.apache.org/~andyc/neko/doc/html/">CyberNeko
> HTML Parser</a> lets you parse HTML documents. It's 
> 		relatively easy to remove most of the tags from an HTML document
> (or all if you want), and then use the ones you left in
> 		to help create metadata for your Lucene document. NekoHTML also
> provides a DOM model for navigating through the HTML.
>         </subsection>        
>         <subsection name="JTidy">
>         	<a href="http://sourceforge.net/projects/jtidy/">JTidy</a>
> cleans up HTML, and can provide a DOM interface to the HTML.
>         	files through a Java API.
>         </subsection>        
>     </section>    
>     
>     <section name="PDF">
>     	<subsection name="PDFBox">
>     		<a href="http://pdfbox.org/">PDFBox</a> is a Java API from Ben
> Litchfield that will let you access the contents of a 
> 	    	PDF document. It comes with integration	classes for Lucene to
> translate a PDF into a Lucene document.
>     	</subsection>
>     	<subsection name="XPDF">
>     		<a href="http://www.foolabs.com/xpdf/">XPDF</a> is an open
> source tool that is licensed under the GPL. It's not a Java
>     		tool, but there is a utility called pdftotext that can
> translate PDF files into text files on most platforms from the
>     		command line.
>     	</subsection>
>     	<subsection name="PDF to HTML">
>     		Based on xpdf, there is a utility called <a
> href="http://pdftohtml.sourceforge.net/">pdftohtml</a> that can
> translate
>     		PDF files into HTML files. This is also not a Java application.
>     	</subsection>
>     	<subsection name="JPedal">
>     		<a href="http://www.jpedal.org/">JPedal</a> is a Java API for
> extracting text and images from PDF documents.
>     	</subsection>    	
>     	<subsection name="TextMining.org">
>     		<a href="http://www.textmining.org/">Simple Text Extractor
> Library</a> for use with PDF documents. Relies on PDFBox.
>     	</subsection>
>     		
>     </section>
> 
>     <section name="XML">
>     	<subsection name="Lucene SAX/DOM indexing Demo">
>     		<a
>
href="http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/XML-Indexing-Demo/">XML
> Demo</a>
>     		This contribution is some sample code that demonstrates adding
> simple XML documents into the index. 
>     		It creates a new Document object for each file, and then
> populates the Document with a Field 
>     		for each XML element, recursively. There are examples included
> for both SAX and DOM. 
>     	</subsection>
>     </section>
>     
>     <section name="Word">
>     	<subsection name="POI">
>     		<a href="http://jakarta.apache.org/poi/">Jakarta Apache POI</a>
> has an early development level Microsoft Word parser
>     		for versions of Word from Office 97, 2000, and XP.
>     	</subsection>
>     	<subsection name="TextMining.org">
>     		<a href="http://www.textmining.org/">Simple Text Extractor
> Library</a> for use with PDF documents. Relies on POI.
>     	</subsection>
>     </section>
>     
>     <section name="Excel">
>     	<subsection name="POI">
>     		<a href="http://jakarta.apache.org/poi/">Jakarta Apache POI</a>
> has an excellent Microsoft Excel parser
>     		for versions of Excel from Office 97, 2000, and XP.  You can
> also modify Excel files with this tool.
>     	</subsection>
>     </section>
> 
>     <section name="RTF - Rich Text Format">
>     	<subsection name="TetraSix MajiX">
>     		<a href="http://www.tetrasix.com/">MajiX</a> is a translation
> utility that will turn RTF (Rich Text Format) files
>     		into XML files. These XML files could be indexed like any other
> XML file, or you could write some custom code. See the 
>     		XML section of this page.
>     	</subsection>
>     </section>
> 
> 
>     </body>
> </document>
> > 21a22
> >         <item name="Parsers"           href="/parsers.html"/>  
> >
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - File online, calculators, forms, and more
http://tax.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: New xdoc for parsers

Posted by Otis Gospodnetic <ot...@yahoo.com>.

> > > Any reason not to include it as an xdoc though as well?  The
> Lucene
> > > site is
> > > a little confusing to the newbie user who might just want to see
> if
> > > Lucene
> > > can match Inktomi, Index Server, whatever by supporting Microsoft
> or
> > > PDF formats.
> >
> > Parsers are not really a part of Lucene, so I thought FAQ entries
> would
> > be better.  If it proves insufficient I'll add them directly to the
> > site.
> >
> Hmm, maybe it's time to add a "Lucene Applications" section of the
> web site?
> I could work on that.

I think the Contributions page includes those already.  What Lucene
Applications do you have in mind?  Perhaps Contributions page is not
well organizes, or perhaps it is incomplete, or perhaps not up to date.
 Maybe we just need to fix that page instead of creating a new one.

Otis


__________________________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: New xdoc for parsers

Posted by Jeff Linwood <je...@greenninja.com>.

"Otis Gospodnetic" <ot...@yahoo.com>
>
> --- Jeff Linwood <je...@greenninja.com> wrote:
> > Cool.
> >
> > Any reason not to include it as an xdoc though as well?  The Lucene
> > site is
> > a little confusing to the newbie user who might just want to see if
> > Lucene
> > can match Inktomi, Index Server, whatever by supporting Microsoft or
> > PDF formats.
>
> Parsers are not really a part of Lucene, so I thought FAQ entries would
> be better.  If it proves insufficient I'll add them directly to the
> site.
>
Hmm, maybe it's time to add a "Lucene Applications" section of the web site?
I could work on that.

jeff


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: New xdoc for parsers

Posted by Otis Gospodnetic <ot...@yahoo.com>.

--- Jeff Linwood <je...@greenninja.com> wrote:
> Cool.
> 
> Any reason not to include it as an xdoc though as well?  The Lucene
> site is
> a little confusing to the newbie user who might just want to see if
> Lucene
> can match Inktomi, Index Server, whatever by supporting Microsoft or
> PDF formats.

Parsers are not really a part of Lucene, so I thought FAQ entries would
be better.  If it proves insufficient I'll add them directly to the
site.

> It also needs a section on Indexing JSP files, since that gets asked
> a lot.

Yes, correct, I will add that soon.

Otis


> jeff
> ----- Original Message -----
> From: "Otis Gospodnetic" <ot...@yahoo.com>
> To: "Lucene Developers List" <lu...@jakarta.apache.org>
> Sent: Tuesday, April 08, 2003 9:55 AM
> Subject: Re: New xdoc for parsers
> 
> 
> > Thanks Jeff, I put all this in the Lucene FAQ at jGuru.
> >
> > Otis
> >
> > --- Jeff Linwood <je...@greenninja.com> wrote:
> > > Since this question comes up all the time on the users list, and
> the
> > > FAQ
> > > entry is umm...unhelpful :) I created an xdoc listing all the
> parsers
> > > I
> > > knew about.
> > >
> > > Yes, I know this sort of duplicates the resources.xml xdoc, but
> this
> > > is
> > > is more descriptive, and more clear from the menu as to what it
> is. I
> > > hope.
> > >
> > > I also included a patch to the project.xml stylesheet if this
> > > contribution gets accepted.
> > >
> > > Thanks,
> > > Jeff Linwood
> > > > <?xml version="1.0"?>
> > > <document>
> > >     <properties>
> > >     <author email="jeff@greenninja.com">Jeff Linwood</author>
> > >     <title>Parsers - Jakarta Lucene</title>
> > >     </properties>
> > >     <body>
> > >
> > >     <section name="Introduction">
> > >         <p>
> > >         Lucene is capable of indexing any file format for
> documents,
> > > but the application that uses the Lucene search engine is
> > >         responsible for translating these document types into a
> > > format that Lucene can understand.  Several of these formats
> > >         can be indexed with open source or free solutions, and
> links
> > > are given to the appropriate sites. Many Lucene users use
> > >         more than one of these in their applications.
> > >         </p>
> > >     </section>
> > >
> > >     <section name="HTML">
> > >         <subsection name="JavaCC and IndexHTML">
> > >         An example that uses JavaCC to parse HTML into Lucene
> > > Document objects is provided in the <a href="demo3.html">Lucene
> web
> > >         application demo</a> that comes with the Lucene
> > > distribution.
> > >         </subsection>
> > >         <subsection name="NekoHTML">
> > > The <a
> href="http://www.apache.org/~andyc/neko/doc/html/">CyberNeko
> > > HTML Parser</a> lets you parse HTML documents. It's
> > > relatively easy to remove most of the tags from an HTML document
> > > (or all if you want), and then use the ones you left in
> > > to help create metadata for your Lucene document. NekoHTML also
> > > provides a DOM model for navigating through the HTML.
> > >         </subsection>
> > >         <subsection name="JTidy">
> > >         <a
> href="http://sourceforge.net/projects/jtidy/">JTidy</a>
> > > cleans up HTML, and can provide a DOM interface to the HTML.
> > >         files through a Java API.
> > >         </subsection>
> > >     </section>
> > >
> > >     <section name="PDF">
> > >     <subsection name="PDFBox">
> > >     <a href="http://pdfbox.org/">PDFBox</a> is a Java API from
> Ben
> > > Litchfield that will let you access the contents of a
> > >     PDF document. It comes with integration classes for Lucene to
> > > translate a PDF into a Lucene document.
> > >     </subsection>
> > >     <subsection name="XPDF">
> > >     <a href="http://www.foolabs.com/xpdf/">XPDF</a> is an open
> > > source tool that is licensed under the GPL. It's not a Java
> > >     tool, but there is a utility called pdftotext that can
> > > translate PDF files into text files on most platforms from the
> > >     command line.
> > >     </subsection>
> > >     <subsection name="PDF to HTML">
> > >     Based on xpdf, there is a utility called <a
> > > href="http://pdftohtml.sourceforge.net/">pdftohtml</a> that can
> > > translate
> > >     PDF files into HTML files. This is also not a Java
> application.
> > >     </subsection>
> > >     <subsection name="JPedal">
> > >     <a href="http://www.jpedal.org/">JPedal</a> is a Java API for
> > > extracting text and images from PDF documents.
> > >     </subsection>
> > >     <subsection name="TextMining.org">
> > >     <a href="http://www.textmining.org/">Simple Text Extractor
> > > Library</a> for use with PDF documents. Relies on PDFBox.
> > >     </subsection>
> > >
> > >     </section>
> > >
> > >     <section name="XML">
> > >     <subsection name="Lucene SAX/DOM indexing Demo">
> > >     <a
> > >
> >
>
href="http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions
> /XML-Indexing-Demo/">XML
> > > Demo</a>
> > >     This contribution is some sample code that demonstrates
> adding
> > > simple XML documents into the index.
> > >     It creates a new Document object for each file, and then
> > > populates the Document with a Field
> > >     for each XML element, recursively. There are examples
> included
> > > for both SAX and DOM.
> > >     </subsection>
> > >     </section>
> > >
> > >     <section name="Word">
> > >     <subsection name="POI">
> > >     <a href="http://jakarta.apache.org/poi/">Jakarta Apache
> POI</a>
> > > has an early development level Microsoft Word parser
> > >     for versions of Word from Office 97, 2000, and XP.
> > >     </subsection>
> > >     <subsection name="TextMining.org">
> > >     <a href="http://www.textmining.org/">Simple Text Extractor
> > > Library</a> for use with PDF documents. Relies on POI.
> > >     </subsection>
> > >     </section>
> > >
> > >     <section name="Excel">
> > >     <subsection name="POI">
> > >     <a href="http://jakarta.apache.org/poi/">Jakarta Apache
> POI</a>
> > > has an excellent Microsoft Excel parser
> > >     for versions of Excel from Office 97, 2000, and XP.  You can
> > > also modify Excel files with this tool.
> > >     </subsection>
> > >     </section>
> > >
> > >     <section name="RTF - Rich Text Format">
> > >     <subsection name="TetraSix MajiX">
> > >     <a href="http://www.tetrasix.com/">MajiX</a> is a translation
> > > utility that will turn RTF (Rich Text Format) files
> > >     into XML files. These XML files could be indexed like any
> other
> > > XML file, or you could write some custom code. See the
> > >     XML section of this page.
> > >     </subsection>
> > >     </section>
> > >
> > >
> > >     </body>
> > > </document>
> > > > 21a22
> > > >         <item name="Parsers"           href="/parsers.html"/>
> > > >
> >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail:
> lucene-dev-help@jakarta.apache.org
> >
> >
> > __________________________________________________
> > Do you Yahoo!?
> > Yahoo! Tax Center - File online, calculators, forms, and more
> > http://tax.yahoo.com
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> >
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - File online, calculators, forms, and more
http://tax.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: New xdoc for parsers

Posted by Jeff Linwood <je...@greenninja.com>.

Cool.

Any reason not to include it as an xdoc though as well?  The Lucene site is
a little confusing to the newbie user who might just want to see if Lucene
can match Inktomi, Index Server, whatever by supporting Microsoft or PDF
formats.

It also needs a section on Indexing JSP files, since that gets asked a lot.

jeff
----- Original Message -----
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Tuesday, April 08, 2003 9:55 AM
Subject: Re: New xdoc for parsers


> Thanks Jeff, I put all this in the Lucene FAQ at jGuru.
>
> Otis
>
> --- Jeff Linwood <je...@greenninja.com> wrote:
> > Since this question comes up all the time on the users list, and the
> > FAQ
> > entry is umm...unhelpful :) I created an xdoc listing all the parsers
> > I
> > knew about.
> >
> > Yes, I know this sort of duplicates the resources.xml xdoc, but this
> > is
> > is more descriptive, and more clear from the menu as to what it is. I
> > hope.
> >
> > I also included a patch to the project.xml stylesheet if this
> > contribution gets accepted.
> >
> > Thanks,
> > Jeff Linwood
> > > <?xml version="1.0"?>
> > <document>
> >     <properties>
> >     <author email="jeff@greenninja.com">Jeff Linwood</author>
> >     <title>Parsers - Jakarta Lucene</title>
> >     </properties>
> >     <body>
> >
> >     <section name="Introduction">
> >         <p>
> >         Lucene is capable of indexing any file format for documents,
> > but the application that uses the Lucene search engine is
> >         responsible for translating these document types into a
> > format that Lucene can understand.  Several of these formats
> >         can be indexed with open source or free solutions, and links
> > are given to the appropriate sites. Many Lucene users use
> >         more than one of these in their applications.
> >         </p>
> >     </section>
> >
> >     <section name="HTML">
> >         <subsection name="JavaCC and IndexHTML">
> >         An example that uses JavaCC to parse HTML into Lucene
> > Document objects is provided in the <a href="demo3.html">Lucene web
> >         application demo</a> that comes with the Lucene
> > distribution.
> >         </subsection>
> >         <subsection name="NekoHTML">
> > The <a href="http://www.apache.org/~andyc/neko/doc/html/">CyberNeko
> > HTML Parser</a> lets you parse HTML documents. It's
> > relatively easy to remove most of the tags from an HTML document
> > (or all if you want), and then use the ones you left in
> > to help create metadata for your Lucene document. NekoHTML also
> > provides a DOM model for navigating through the HTML.
> >         </subsection>
> >         <subsection name="JTidy">
> >         <a href="http://sourceforge.net/projects/jtidy/">JTidy</a>
> > cleans up HTML, and can provide a DOM interface to the HTML.
> >         files through a Java API.
> >         </subsection>
> >     </section>
> >
> >     <section name="PDF">
> >     <subsection name="PDFBox">
> >     <a href="http://pdfbox.org/">PDFBox</a> is a Java API from Ben
> > Litchfield that will let you access the contents of a
> >     PDF document. It comes with integration classes for Lucene to
> > translate a PDF into a Lucene document.
> >     </subsection>
> >     <subsection name="XPDF">
> >     <a href="http://www.foolabs.com/xpdf/">XPDF</a> is an open
> > source tool that is licensed under the GPL. It's not a Java
> >     tool, but there is a utility called pdftotext that can
> > translate PDF files into text files on most platforms from the
> >     command line.
> >     </subsection>
> >     <subsection name="PDF to HTML">
> >     Based on xpdf, there is a utility called <a
> > href="http://pdftohtml.sourceforge.net/">pdftohtml</a> that can
> > translate
> >     PDF files into HTML files. This is also not a Java application.
> >     </subsection>
> >     <subsection name="JPedal">
> >     <a href="http://www.jpedal.org/">JPedal</a> is a Java API for
> > extracting text and images from PDF documents.
> >     </subsection>
> >     <subsection name="TextMining.org">
> >     <a href="http://www.textmining.org/">Simple Text Extractor
> > Library</a> for use with PDF documents. Relies on PDFBox.
> >     </subsection>
> >
> >     </section>
> >
> >     <section name="XML">
> >     <subsection name="Lucene SAX/DOM indexing Demo">
> >     <a
> >
>
href="http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions
/XML-Indexing-Demo/">XML
> > Demo</a>
> >     This contribution is some sample code that demonstrates adding
> > simple XML documents into the index.
> >     It creates a new Document object for each file, and then
> > populates the Document with a Field
> >     for each XML element, recursively. There are examples included
> > for both SAX and DOM.
> >     </subsection>
> >     </section>
> >
> >     <section name="Word">
> >     <subsection name="POI">
> >     <a href="http://jakarta.apache.org/poi/">Jakarta Apache POI</a>
> > has an early development level Microsoft Word parser
> >     for versions of Word from Office 97, 2000, and XP.
> >     </subsection>
> >     <subsection name="TextMining.org">
> >     <a href="http://www.textmining.org/">Simple Text Extractor
> > Library</a> for use with PDF documents. Relies on POI.
> >     </subsection>
> >     </section>
> >
> >     <section name="Excel">
> >     <subsection name="POI">
> >     <a href="http://jakarta.apache.org/poi/">Jakarta Apache POI</a>
> > has an excellent Microsoft Excel parser
> >     for versions of Excel from Office 97, 2000, and XP.  You can
> > also modify Excel files with this tool.
> >     </subsection>
> >     </section>
> >
> >     <section name="RTF - Rich Text Format">
> >     <subsection name="TetraSix MajiX">
> >     <a href="http://www.tetrasix.com/">MajiX</a> is a translation
> > utility that will turn RTF (Rich Text Format) files
> >     into XML files. These XML files could be indexed like any other
> > XML file, or you could write some custom code. See the
> >     XML section of this page.
> >     </subsection>
> >     </section>
> >
> >
> >     </body>
> > </document>
> > > 21a22
> > >         <item name="Parsers"           href="/parsers.html"/>
> > >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Tax Center - File online, calculators, forms, and more
> http://tax.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org