You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jukka Zitting <ju...@gmail.com> on 2007/10/05 14:20:36 UTC

Parser roadmap

Hi,

As you've seen, I've been refactoring the Parser classes quite heavily
for the past few weeks, and now with TIKA-43 I'm reaching a milestone
that already resembles the proposed interface design.

Once TIKA-43 is committed (I'm giving it a day or two for reviews and
comments) there are still two Parser related changes that I'd like to
do before I think we're ready to do the first 0.1 release.

First, I'd like to replace the current Iterable<Content> construct
with a Metadata object that allows metadata to be passed in and out of
the parser. Also, this Metadata object should be decoupled from parser
configuration.

Second, instead of returning the text content of a document as a
String, I'd like the parsers to generate SAX events with the text
content passed as characters() events.

Unless anyone objects (feel free to do so if you have better design
ideas!), I'll follow up with new patches for these two issues in the
next week or two. Once these changes are done, I think we're good to
go for the first Tika release. Such a timing would also be perfect for
the upcoming ApacheCon US conference. :-)

BR,

Jukka Zitting

Re: Parser roadmap

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On 10/10/07, Keith R. Bennett <kb...@bbsinc.biz> wrote:
>
> I don't know if I officially have a vote yet,

everyone has a vote :-)

it's just that only some votes (PMC) are binding upon apache

- robert

Re: Parser roadmap

Posted by "Keith R. Bennett" <kb...@bbsinc.biz>.
I don't know if I officially have a vote yet, but I will continue the
unanimity and vote for Chris too!

- Keith


Rida Benjelloun wrote:
> 
> +1 for Chris as our Release Manager!
> Rida
> 
> 

-- 
View this message in context: http://www.nabble.com/Parser-roadmap-tf4574793.html#a13138911
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: Parser roadmap

Posted by Rida Benjelloun <ri...@doculibre.com>.
+1 for Chris as our Release Manager!
Rida

2007/10/10, Bertrand Delacretaz <bd...@apache.org>:
>
> On 10/7/07, Jukka Zitting <ju...@gmail.com> wrote:
>
> > On 10/6/07, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
> > > I'll put my name out there as someone available to be the release
> master
> > > when the time comes....
> >
> > +1!
>
> +1 for Chris as our Release Manager!
>
> -Bertrand
>

Re: Parser roadmap

Posted by Bertrand Delacretaz <bd...@apache.org>.
On 10/7/07, Jukka Zitting <ju...@gmail.com> wrote:

> On 10/6/07, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
> > I'll put my name out there as someone available to be the release master
> > when the time comes....
>
> +1!

+1 for Chris as our Release Manager!

-Bertrand

Re: Parser roadmap

Posted by Bertrand Delacretaz <bd...@apache.org>.
On 10/7/07, Jukka Zitting <ju...@gmail.com> wrote:

> ...I'd rather go with:
>
>     void parse(InputStream stream, ContentHandler handler, Metadata metadata)
>         throws IOException, SAXException, TikaException;
>
> I.e. the parser invokes a series of callback methods on the given
> handler instance. This way the parse result never needs to be
> contained in a single object....

Sounds good to me!

-Bertrand

Re: Parser roadmap

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 10/10/07, Sami Siren <ss...@gmail.com> wrote:
> Does this mean Tika users need to implement "parser" (ContentHandler)
> that can handle events fired by Tika Parser. One for each format? Or do
> we plan to normalize events somehow?

The main rationale for outputting XML is to be able to express things
like "this is a heading", "this is a link", etc. so that for example a
search engine can put more weight on those parts of the content.

My preference would be to use XHTML Basic as the XML format that the
parsers will output. XHTML is widely known and supported, and is more
than expressive enough for our needs.

> Or is Tika going to provide those handlers for simple tasks like
> extracting title + content.

I would at least have utility adapters that convert the SAX events to
a character stream and further to a single string.

BR,

Jukka Zitting

Re: Parser roadmap

Posted by Sami Siren <ss...@gmail.com>.
Jukka Zitting wrote:

> I'd rather go with:
> 
>     void parse(InputStream stream, ContentHandler handler, Metadata metadata)
>         throws IOException, SAXException, TikaException;
> 
> I.e. the parser invokes a series of callback methods on the given
> handler instance. This way the parse result never needs to be
> contained in a single object.

Does this mean Tika users need to implement "parser" (ContentHandler)
that can handle events fired by Tika Parser. One for each format? Or do
we plan to normalize events somehow?

Or is Tika going to provide those handlers for simple tasks like
extracting title + content.


-- 
 Sami Siren

Re: Parser roadmap

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 10/6/07, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
> I'll put my name out there as someone available to be the release master
> when the time comes. I've done it on Nutch before and wouldn't mind doing it
> for Tika. Just let me know if you guys agree.

+1!

> > First, I'd like to replace the current Iterable<Content> construct
> > with a Metadata object that allows metadata to be passed in and out of
> > the parser. Also, this Metadata object should be decoupled from parser
> > configuration.
>
> I completely agree. I'd like to help with this issue as the Metadata
> framework is very near and dear to my heart. What's the interface that you
> are proposing for it look like again? Something like:
>
> String parse(InputStream stream, Metadata metadata)
>              throws IOException, TikaException;

Exactly.

> > Second, instead of returning the text content of a document as a
> > String, I'd like the parsers to generate SAX events with the text
> > content passed as characters() events.
>
> Then, the next evolutionary step would be:
>
> SAXEvent parse(InputStream stream, Metadata metadata)
>             throws IOException, TikaException;

I'd rather go with:

    void parse(InputStream stream, ContentHandler handler, Metadata metadata)
        throws IOException, SAXException, TikaException;

I.e. the parser invokes a series of callback methods on the given
handler instance. This way the parse result never needs to be
contained in a single object.

BR,

Jukka Zitting

Re: Parser roadmap

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Rida,

[..snip..]
> however the metadata
> class should not be limited to one metadata standard example DublinCore, I
> think that metadata class should be extensible or generic to support
> multiple metadata standards.

The current Metadata class is extensible to support any metadata standard.
The existing interfaces that it implements are meant to be helper tools to
standardize the set of MetKeys when you actually want to use standard
metadata field names: however, it doesn't preclude the use of any Metadata
key field name that you'd like. In other words it supports both:

//example 1
Metadata m = new Metadata();
m.addMetadata(DC_TITLE, "Rida");

Just the same as it supports:

//example 2
Metadata m = new Metadata();
m.addMetadata("your_field_name_here", "Rida");

If it's determined that the set of "your_field_name_here" keys makes sense
and is in widespread use throughout the code, for convenience purposes, we
could create an interface:

public interface MyKeys{
 
  public static final String YOUR_KEY_1 = "my_key_1";

  //...
}

And then have the default Metadata class extend that interface:

public class Metadata implements DublinCore...,MyKeys{
 // rest of code
}

But this isn't a requirement, and should only be done where it makes sense
to. Just wanted to clarify that.

Thanks!

Cheers,
  Chris


______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



Re: Parser roadmap

Posted by Rida Benjelloun <ri...@doculibre.com>.
Hi Jukka,
Totally agree with the parser roadmap. Thanks for this good job. I also
agree with replacing Content class by Matadata class, however the metadata
class should not be limited to one metadata standard example DublinCore, I
think that metadata class should be extensible or generic to support
multiple metadata standards.

Regards.

On 10/5/07, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
>
> Hi Jukka,
>
> > Once TIKA-43 is committed (I'm giving it a day or two for reviews and
> > comments) there are still two Parser related changes that I'd like to
> > do before I think we're ready to do the first 0.1 release.
>
> +1, agreed. At present, we've worked through 30 JIRA issues so far (great
> job guys!), and I think that the library is reaching stability and is
> primed
> for an official release.
>
> I'll put my name out there as someone available to be the release master
> when the time comes. I've done it on Nutch before and wouldn't mind doing
> it
> for Tika. Just let me know if you guys agree.
>
> >
> > First, I'd like to replace the current Iterable<Content> construct
> > with a Metadata object that allows metadata to be passed in and out of
> > the parser. Also, this Metadata object should be decoupled from parser
> > configuration.
>
> I completely agree. I'd like to help with this issue as the Metadata
> framework is very near and dear to my heart. What's the interface that you
> are proposing for it look like again? Something like:
>
> String parse(InputStream stream, Metadata metadata)
>              throws IOException, TikaException;
>
>
> >
> > Second, instead of returning the text content of a document as a
> > String, I'd like the parsers to generate SAX events with the text
> > content passed as characters() events.
>
> Then, the next evolutionary step would be:
>
> SAXEvent parse(InputStream stream, Metadata metadata)
>             throws IOException, TikaException;
>
> ?
>
> >
> > Unless anyone objects (feel free to do so if you have better design
> > ideas!), I'll follow up with new patches for these two issues in the
> > next week or two. Once these changes are done, I think we're good to
> > go for the first Tika release. Such a timing would also be perfect for
> > the upcoming ApacheCon US conference. :-)
>
> Totally agree! Great job so far: I am really starting to like this new
> Parsing interface...
>
> Cheers,
>   Chris
>
> >
> > BR,
> >
> > Jukka Zitting
>
> ______________________________________________
> Chris Mattmann, Ph.D.
> Chris.Mattmann@jpl.nasa.gov
> Cognizant Development Engineer
> Early Detection Research Network Project
>
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                     Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>


-- 
---------------------------------------------------------
Rida Benjelloun
Doculibre inc.
ridabenjelloun@apache.org
rida.benjelloun@doculibre.com
Cel: 418-262-3222
Tel: 418-353-3390
Site Web : http://www.doculibre.com
---------------------------------------------------------

Re: Parser roadmap

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Jukka,

> Once TIKA-43 is committed (I'm giving it a day or two for reviews and
> comments) there are still two Parser related changes that I'd like to
> do before I think we're ready to do the first 0.1 release.

+1, agreed. At present, we've worked through 30 JIRA issues so far (great
job guys!), and I think that the library is reaching stability and is primed
for an official release.

I'll put my name out there as someone available to be the release master
when the time comes. I've done it on Nutch before and wouldn't mind doing it
for Tika. Just let me know if you guys agree.

> 
> First, I'd like to replace the current Iterable<Content> construct
> with a Metadata object that allows metadata to be passed in and out of
> the parser. Also, this Metadata object should be decoupled from parser
> configuration.

I completely agree. I'd like to help with this issue as the Metadata
framework is very near and dear to my heart. What's the interface that you
are proposing for it look like again? Something like:

String parse(InputStream stream, Metadata metadata)
             throws IOException, TikaException;


> 
> Second, instead of returning the text content of a document as a
> String, I'd like the parsers to generate SAX events with the text
> content passed as characters() events.

Then, the next evolutionary step would be:

SAXEvent parse(InputStream stream, Metadata metadata)
            throws IOException, TikaException;

?

> 
> Unless anyone objects (feel free to do so if you have better design
> ideas!), I'll follow up with new patches for these two issues in the
> next week or two. Once these changes are done, I think we're good to
> go for the first Tika release. Such a timing would also be perfect for
> the upcoming ApacheCon US conference. :-)

Totally agree! Great job so far: I am really starting to like this new
Parsing interface...

Cheers,
  Chris

> 
> BR,
> 
> Jukka Zitting

______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.