You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Tod Thomas <tt...@chubb.com> on 2003/07/15 16:09:29 UTC

Parser Question

I noticed from the FAQ that the developer must provide a parser for every
type of document that requires indexing by Lucene.  Does Lucen have a
'plugin' capacity to easily add a new parser into the mix?

Forgive me if this is a dumb question, I haven't yet looked at the source
code, or the configuration in detail.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Parser Question

Posted by Peter Becker <pb...@dstc.edu.au>.

Tod Thomas wrote:

>Peter Becker wrote:
>
>  
>
>>Leo Galambos wrote:
>>
>>    
>>
>>>Peter Becker wrote:
>>>
>>>      
>>>
>>>>Hi Tod,
>>>>
>>>>as far as I know Lucene itself doesn't offer this (at least we failed
>>>>to find it). The closest thing available seem to be the Ant tasks.
>>>>
>>>>We are currently working on introducing this notion for our program,
>>>>which is open source. Beside the plugin mechanism there will be a
>>>>file filter mapping and a thread mechanism to maintain an index as
>>>>well as implementations using POI and Multivalent. Give us another
>>>>week or two.
>>>>        
>>>>
>>>Unfortunately, I didn't get this. Could you explain the mechanism,
>>>please? Thank you
>>>      
>>>
>>Not fully yet, since we are still working on it ;-) You can find the
>>code in our CVS repository on Sourceforge:
>>
>>
>>http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/
>>
>>The idea is that you have to supply different parsers for different
>>formats, then turn the results found into Lucene Document objects. At
>>the moment we do this using a normal interface similar to the one used
>>in the Java Ant tasks (see the "handlers" directory), but we want to
>>turn it into a plugin interface. Our tool should in the end do TXT, HTML
>>and XML out of the box and have at least three plugin implementations:
>>
>>  - POI for .doc, .xls
>>  - PDFbox for .pdf
>>  - Multivalent for .pdf, .dvi and others
>>
>>The plugin API will be extremely simple and it should fit easily with
>>the Ant tasks, so you should be able to wrap our code into an Ant task
>>or whatever interface you need.
>>    
>>
>
>This sounds really cool.  If I'm reading you correctly it will be a fairly intuitive
>exercise to port parsers writtent in Java for existing file formats to use your plugin
>architecture.  Accurate?
>  
>
The basic interface is just getting a URL and returning a data object 
containing the different types of information: body, title, authors, and 
so on. The data object returned is where we differ from the Ant task, 
where a Lucene Document is created -- the main reason we did it 
different is that we didn't want to decide the way of indexing (mostly 
the stored vs. unstored) in the parser code. But it also gives the 
option to use the parsers for something different than Lucene.

There will be some overhead for the plugin management, which we haven't 
done yet. Hopefully it will be very simple to use -- the idea here is to 
keep the complexity (if needed) in the plugin manager we will provide. 
We aim at getting this done very soon.

More info when it is done :-) If you are too curious just read the code ;-)

  Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Parser Question

Posted by Tod Thomas <tt...@chubb.com>.

Peter Becker wrote:

> Leo Galambos wrote:
>
> > Peter Becker wrote:
> >
> >> Hi Tod,
> >>
> >> as far as I know Lucene itself doesn't offer this (at least we failed
> >> to find it). The closest thing available seem to be the Ant tasks.
> >>
> >> We are currently working on introducing this notion for our program,
> >> which is open source. Beside the plugin mechanism there will be a
> >> file filter mapping and a thread mechanism to maintain an index as
> >> well as implementations using POI and Multivalent. Give us another
> >> week or two.
> >
> >
> > Unfortunately, I didn't get this. Could you explain the mechanism,
> > please? Thank you
>
> Not fully yet, since we are still working on it ;-) You can find the
> code in our CVS repository on Sourceforge:
>
>
> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/
>
> The idea is that you have to supply different parsers for different
> formats, then turn the results found into Lucene Document objects. At
> the moment we do this using a normal interface similar to the one used
> in the Java Ant tasks (see the "handlers" directory), but we want to
> turn it into a plugin interface. Our tool should in the end do TXT, HTML
> and XML out of the box and have at least three plugin implementations:
>
>   - POI for .doc, .xls
>   - PDFbox for .pdf
>   - Multivalent for .pdf, .dvi and others
>
> The plugin API will be extremely simple and it should fit easily with
> the Ant tasks, so you should be able to wrap our code into an Ant task
> or whatever interface you need.

This sounds really cool.  If I'm reading you correctly it will be a fairly intuitive
exercise to port parsers writtent in Java for existing file formats to use your plugin
architecture.  Accurate?

Tod


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Parser Question

Posted by Peter Becker <pb...@dstc.edu.au>.

Leo Galambos wrote:

> Peter Becker wrote:
>
>> Hi Tod,
>>
>> as far as I know Lucene itself doesn't offer this (at least we failed 
>> to find it). The closest thing available seem to be the Ant tasks.
>>
>> We are currently working on introducing this notion for our program, 
>> which is open source. Beside the plugin mechanism there will be a 
>> file filter mapping and a thread mechanism to maintain an index as 
>> well as implementations using POI and Multivalent. Give us another 
>> week or two.
>
>
> Unfortunately, I didn't get this. Could you explain the mechanism, 
> please? Thank you 

Not fully yet, since we are still working on it ;-) You can find the 
code in our CVS repository on Sourceforge:

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/

The idea is that you have to supply different parsers for different 
formats, then turn the results found into Lucene Document objects. At 
the moment we do this using a normal interface similar to the one used 
in the Java Ant tasks (see the "handlers" directory), but we want to 
turn it into a plugin interface. Our tool should in the end do TXT, HTML 
and XML out of the box and have at least three plugin implementations:

  - POI for .doc, .xls
  - PDFbox for .pdf
  - Multivalent for .pdf, .dvi and others

The plugin API will be extremely simple and it should fit easily with 
the Ant tasks, so you should be able to wrap our code into an Ant task 
or whatever interface you need.

  Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Parser Question

Posted by Leo Galambos <Le...@seznam.cz>.

Peter Becker wrote:

> Hi Tod,
>
> as far as I know Lucene itself doesn't offer this (at least we failed 
> to find it). The closest thing available seem to be the Ant tasks.
>
> We are currently working on introducing this notion for our program, 
> which is open source. Beside the plugin mechanism there will be a file 
> filter mapping and a thread mechanism to maintain an index as well as 
> implementations using POI and Multivalent. Give us another week or two.

Unfortunately, I didn't get this. Could you explain the mechanism, 
please? Thank you

-g-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Parser Question

Posted by Peter Becker <pb...@dstc.edu.au>.

Hi Tod,

as far as I know Lucene itself doesn't offer this (at least we failed to 
find it). The closest thing available seem to be the Ant tasks.

We are currently working on introducing this notion for our program, 
which is open source. Beside the plugin mechanism there will be a file 
filter mapping and a thread mechanism to maintain an index as well as 
implementations using POI and Multivalent. Give us another week or two.

BTW: has anyone looked into the option of using the OpenOffice UDK 
(http://udk.openoffice.org/) as document parser? We wanted to, but I am 
afraid we won't have the time. It sure will be a huge plugin and not as 
easy to deploy as the average JAR, but it would support a large range of 
documents and should be very suited for enterprise document collections.

  Peter

Tod Thomas wrote:

>I noticed from the FAQ that the developer must provide a parser for every
>type of document that requires indexing by Lucene.  Does Lucen have a
>'plugin' capacity to easily add a new parser into the mix?
>
>Forgive me if this is a dumb question, I haven't yet looked at the source
>code, or the configuration in detail.
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org