You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Keith R. Bennett" <kb...@bbsinc.biz> on 2007/10/17 23:56:41 UTC

Fulltext Metadata Property?

All -

Do we want to remove the functionality in ParserPostProcessor that
accumulates the full text into the "fulltext" property in the Metadata?

If I understand correctly, one of the reasons for the ContentHandler
architecture was to support not reading all content into memory.  Doesn't
this read the entire parsed content into memory?  And, if the wrapped parser
does the same, and the external parser implementation (e.g. Poi) does the
same, then the maximum document size we can support becomes much smaller?

- Keith

-- 
View this message in context: http://www.nabble.com/Fulltext-Metadata-Property--tf4643633.html#a13263876
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: Fulltext Metadata Property?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 10/23/07, Keith R. Bennett <kb...@bbsinc.biz> wrote:
> The ParserPostProcessor creates a TeeHandler that sends output to the
> caller's handler and, in addition, its own WriteOutContentHandler.  So, if I
> understand correctly, if the caller's handler is also using a
> WriteOutContentHandler or equivalent, then the full text is being saved in
> two StringWriter's, no?

Sure, but why would the caller use a WriteOutContentHandler, if she
already uses ParserPostProcessor and can get the fulltext string from
metadata.get("fulltext")?

BR,

Jukka Zitting

Re: Fulltext Metadata Property?

Posted by "Keith R. Bennett" <kb...@bbsinc.biz>.
Jukka -

The ParserPostProcessor creates a TeeHandler that sends output to the
caller's handler and, in addition, its own WriteOutContentHandler.  So, if I
understand correctly, if the caller's handler is also using a
WriteOutContentHandler or equivalent, then the full text is being saved in
two StringWriter's, no?

Regards,
Keith




Jukka Zitting wrote:
> 
> Hi,
> 
> On 10/22/07, Keith R. Bennett <kb...@bbsinc.biz> wrote:
>> I was thinking that as a ContentHandler, the user could choose to place
>> all
>> the data in memory, and there would be a single copy of the full text.
>>
>> As the ParserPostProcessor, if I understand correctly, the user is bound
>> to
>> consume the extra memory if using the AutoDetectParser, and we are
>> probably
>> consuming twice as much memory to do so, since we would be saving the
>> full
>> text in two different string writers.
> 
> I don't quite follow you. AutoDetectParser never reads the full
> content into memory (of course unless an underlying parser does it).
> 
>> So I was thinking of moving the existing logic from the
>> ParserPostProcessor
>> to a ContentHandler implementation.
> 
> Sure, why not.
> 
> If I understand you correctly, you'd prefer something like this:
> 
>     Parser parser = ...;
>     Metadata metadata = new Metadata();
>     parser.parse(..., new FullTextContentHandler(metadata), metadata);
> 
> over:
> 
>     Parser parser = new ParserPostProcessor(...);
>     Metadata metadata = new Metadata();
>     parser.parse(..., new DefaultHandler(), metadata);
> 
> BR,
> 
> Jukka Zitting
> 
> 

-- 
View this message in context: http://www.nabble.com/Fulltext-Metadata-Property--tf4643633.html#a13352591
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: Fulltext Metadata Property?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 10/22/07, Keith R. Bennett <kb...@bbsinc.biz> wrote:
> I was thinking that as a ContentHandler, the user could choose to place all
> the data in memory, and there would be a single copy of the full text.
>
> As the ParserPostProcessor, if I understand correctly, the user is bound to
> consume the extra memory if using the AutoDetectParser, and we are probably
> consuming twice as much memory to do so, since we would be saving the full
> text in two different string writers.

I don't quite follow you. AutoDetectParser never reads the full
content into memory (of course unless an underlying parser does it).

> So I was thinking of moving the existing logic from the ParserPostProcessor
> to a ContentHandler implementation.

Sure, why not.

If I understand you correctly, you'd prefer something like this:

    Parser parser = ...;
    Metadata metadata = new Metadata();
    parser.parse(..., new FullTextContentHandler(metadata), metadata);

over:

    Parser parser = new ParserPostProcessor(...);
    Metadata metadata = new Metadata();
    parser.parse(..., new DefaultHandler(), metadata);

BR,

Jukka Zitting

Re: Fulltext Metadata Property?

Posted by "Keith R. Bennett" <kb...@bbsinc.biz>.
Jukka -

I was thinking that as a ContentHandler, the user could choose to place all
the data in memory, and there would be a single copy of the full text.

As the ParserPostProcessor, if I understand correctly, the user is bound to
consume the extra memory if using the AutoDetectParser, and we are probably
consuming twice as much memory to do so, since we would be saving the full
text in two different string writers.

So I was thinking of moving the existing logic from the ParserPostProcessor
to a ContentHandler implementation.

- Keith




Jukka Zitting wrote:
> 
> Hi,
> 
> On 10/22/07, Keith R. Bennett <kb...@bbsinc.biz> wrote:
>> > The summary and outLinks implementation based on SAX events may be
>> > more complex but it's still doable, so I'd rather focus on making that
>> > work.
>>
>> That's not something I'd feel confident in implementing correctly, so I
>> won't offer to do that.  If you'd like me to implement a simpler,
>> temporary
>> solution, feel free to let me know.
> 
> What's wrong with the current code in ParserPostProcessor? The reason
> why I objected to just removing the class is that it already
> implements the functionality that you're asking for.
> 
> I'm fine if you want to refactor the class into something else, but I
> don't see the logic of first removing it and then implementing the
> same functionality from scratch.
> 
> BR,
> 
> Jukka Zitting
> 
> 

-- 
View this message in context: http://www.nabble.com/Fulltext-Metadata-Property--tf4643633.html#a13352214
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: Fulltext Metadata Property?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 10/22/07, Keith R. Bennett <kb...@bbsinc.biz> wrote:
> > The summary and outLinks implementation based on SAX events may be
> > more complex but it's still doable, so I'd rather focus on making that
> > work.
>
> That's not something I'd feel confident in implementing correctly, so I
> won't offer to do that.  If you'd like me to implement a simpler, temporary
> solution, feel free to let me know.

What's wrong with the current code in ParserPostProcessor? The reason
why I objected to just removing the class is that it already
implements the functionality that you're asking for.

I'm fine if you want to refactor the class into something else, but I
don't see the logic of first removing it and then implementing the
same functionality from scratch.

BR,

Jukka Zitting

Re: Fulltext Metadata Property?

Posted by "Keith R. Bennett" <kb...@bbsinc.biz>.
Jukka -

> The summary and outLinks implementation based on SAX events may be
> more complex but it's still doable, so I'd rather focus on making that
> work.

That's not something I'd feel confident in implementing correctly, so I
won't offer to do that.  If you'd like me to implement a simpler, temporary
solution, feel free to let me know.

Regards,
- Keith

-- 
View this message in context: http://www.nabble.com/Fulltext-Metadata-Property--tf4643633.html#a13349034
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: Fulltext Metadata Property?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 10/19/07, Keith R. Bennett <kb...@bbsinc.biz> wrote:
> The summary and outlinks information may need the text from multiple SAX
> events, so their implementation may not be trivial, unless we accumulate all
> parsed text in a single string, and then inspect that string (as you did in
> ParserPostProcessor).
>
> Therefore, since fulltext, summary, and outlinks all benefit from all text
> being in a single string, why not create a single implementation of
> ContentHandler that populates all of them?  Then the full text string would
> be in only one place in Tika.

The summary and outLinks implementation based on SAX events may be
more complex but it's still doable, so I'd rather focus on making that
work. The more places we have in Tika that read the content into a
single string, the harder it will be to support really large
documents.

BR,

Jukka Zitting

Re: Fulltext Metadata Property?

Posted by "Keith R. Bennett" <kb...@bbsinc.biz>.
Jukka -

I committed a temporary fix that disabled the use of the
ParserPostProcessor, but that removed the functionality altogether, so if
you like we can discuss how to restore this functionality as an option.

The summary and outlinks information may need the text from multiple SAX
events, so their implementation may not be trivial, unless we accumulate all
parsed text in a single string, and then inspect that string (as you did in
ParserPostProcessor).

Therefore, since fulltext, summary, and outlinks all benefit from all text
being in a single string, why not create a single implementation of
ContentHandler that populates all of them?  Then the full text string would
be in only one place in Tika.

This could also shield the user from some complexity -- this handler would
create the StringWriter itself.  Also, memory would be saved because the
same string would be used by both a Metadata.get("fullText") and
XyzContentHandler.getFullText().

If this idea sounds good, what would you suggest naming this handler? 
FulltextContentHandler?  DefaultContentHandler?  Something else?

- Keith


Jukka Zitting wrote:
> 
> Hi,
> 
> On 10/18/07, Keith R. Bennett <kb...@bbsinc.biz> wrote:
>> After removing those things, the ParserPostProcessor doesn't do anything.
>> Do you want to remove it altogether?  We could also just not instantiate
>> it
>> -- in TikaConfig, we would add the parser implementation without wrapping
>> it
>> in a ParserPostProcessor.
> 
> I'd be OK replacing it with SummaryContentHandler and
> OutLinksContentHandler, i.e. ContentHandler classes that would extract
> the summary text and any matched URIs from the text content. This way
> we'd still have all the functionality in Tika.
> 
> BR,
> 
> Jukka Zitting
> 
> 

-- 
View this message in context: http://www.nabble.com/Fulltext-Metadata-Property--tf4643633.html#a13300082
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: Fulltext Metadata Property?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 10/18/07, Keith R. Bennett <kb...@bbsinc.biz> wrote:
> After removing those things, the ParserPostProcessor doesn't do anything.
> Do you want to remove it altogether?  We could also just not instantiate it
> -- in TikaConfig, we would add the parser implementation without wrapping it
> in a ParserPostProcessor.

I'd be OK replacing it with SummaryContentHandler and
OutLinksContentHandler, i.e. ContentHandler classes that would extract
the summary text and any matched URIs from the text content. This way
we'd still have all the functionality in Tika.

BR,

Jukka Zitting

Re: Fulltext Metadata Property?

Posted by "Keith R. Bennett" <kb...@bbsinc.biz>.
Jukka -

After removing those things, the ParserPostProcessor doesn't do anything. 
Do you want to remove it altogether?  We could also just not instantiate it
-- in TikaConfig, we would add the parser implementation without wrapping it
in a ParserPostProcessor.

- Keith




Jukka Zitting wrote:
> 
> Hi,
> 
> On 10/18/07, Keith R. Bennett <kb...@bbsinc.biz> wrote:
>> Do we want to remove the functionality in ParserPostProcessor that
>> accumulates the full text into the "fulltext" property in the Metadata?
> 
> Definitely! I left it there while refactoring the Parser interface to
> minimize functional changes, but I think we should remove it now.
> 
> The same goes for the summary and outLinks properties. We might
> perhaps add those features as optional Parser decorators.
> 
> BR,
> 
> Jukka Zitting
> 
> 

-- 
View this message in context: http://www.nabble.com/Fulltext-Metadata-Property--tf4643633.html#a13264441
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: Fulltext Metadata Property?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 10/18/07, Keith R. Bennett <kb...@bbsinc.biz> wrote:
> Do we want to remove the functionality in ParserPostProcessor that
> accumulates the full text into the "fulltext" property in the Metadata?

Definitely! I left it there while refactoring the Parser interface to
minimize functional changes, but I think we should remove it now.

The same goes for the summary and outLinks properties. We might
perhaps add those features as optional Parser decorators.

BR,

Jukka Zitting