You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by kbennett <kb...@bbsinc.biz> on 2007/09/10 18:13:31 UTC

Re: Tika use cases

Hi, all.  I'm new here, so if I don't know what I'm talking about, feel free
to correct me. :)

It seems to me that options going into the parser are logically different
from metadata coming out of the parser, and that to maximize the code's
cohesion (see http://en.wikipedia.org/wiki/Cohesion_%28computer_science%29),
it would be preferable to express them as two different objects.

Also, if the metadata is the only output of the parser (as it appears to be
in the use case), why not have the parser create the metadata object itself,
and return it as the return value?  This would seem like a more natural
interface.

So, using this approach, the code would look something like this:

InputStream stream = ...;
ParseOptions parseOptions = ...
SomeTikaInterface parser = new SomeTikaClass();
Metadata metadata = parser.extractMetadata(stream, options);...

... or, alternatively, the ParseOptions might be used to instantiate the
parser instead of being passed to the extractMetadata() method.

- Keith

Jukka Zitting wrote:
> 
> Hi,
> 
> On 8/25/07, Bertrand Delacretaz <bd...@apache.org> wrote:
>> On 8/24/07, Jukka Zitting <ju...@gmail.com> wrote:
>> > ...Extract metadata:
>> >
>> >     InputStream stream = ...;
>> >     Metadata metadata = new Metadata();
>> >     SomeTikaInterface parser = new SomeTikaClass();
>> >     parser.extractMetadata(stream, metadata);...
>>
>> Maybe this (and extractContent() as well) need an additional
>> TikaParseOptions parameter that sets options just for this parsing
>> call?
> 
> Good point, though we could also pass all such options as a part of
> the metadata argument. If the options affect just this one document,
> then I would argue that those options might as well be a part of the
> document-specific metadata.
> 
> More generic options, like the XML parser options to use when parsing
> application/xml documents, should probably be handled as JavaBean
> properties of the instantiated parser objects.
> 
> BR,
> 
> Jukka Zitting
> 
> 

-- 
View this message in context: http://www.nabble.com/Tika-use-cases-tf4287938.html#a12596742
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: Tika use cases

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 9/10/07, kbennett <kb...@bbsinc.biz> wrote:
> Thanks for responding.  What you said made perfect sense.  My domain
> knowledge in this area is very limited, so I apologize in advance for that.

No need to apologize. I don't consider being much of an expert myself
either, so feel free to dispute anything I say. :-)

> So a given parser (e.g. an MS Word document parser) might be instantiated at
> its first use with "global" options, that is, options for all parses, and
> then each call to extractMetadata would use that instance and be given
> file-specific options?  So it might look something like this?:

Exactly! An even more concrete example would be:

    // I want to extract metadata from a file I've been given
    File file = ...;

    // Construct a composite parser capable of parsing multiple document types
    CompositeParser composite = new CompositeParser();

    // Add support for MS Word documents
    WordParser word = new WordParser();
    word.setExtractDocumentProperties(false); // Nobody ever fills in these
    word.setExtractDeletedContent(true); // I want all the secrets!
    composite.addParser(word);

    // Fill in all the metadata we already know
    Metadata metadata = new Metadata();
    metadata.assert("filename", file.getName(), Confidence.CERTAIN);
    metadata.assert("content-length", file.length(), Confidence.CERTAIN);

    // Extract metadata from the given file
    InputStream stream = new FileInputStream(file);
    try {
        composite.extractMetadata(stream, metadata);
    } finally {
        stream.close();
    }

Note that we might well want to include some convenience code to
streamline common options, but the above reflects my understanding of
a truly generic mechanism.

Some specific notes on the above example code, especially on parts I
haven't discussed before:

1) The interfaces as currently envisioned should work seamlessly with
composition and decoration. I think "compatibility" with such patterns
is highly desirable.

2) I'd like to extend the current Metadata framework from Nutch with
support for multiple (potentially conflicting) sources of information
with various confidence levels. See the above code for an early
example. Support for things like the Shared MIME info database also
require such "fuzzy" metadata.

3) After some consideration I think it's better if the parser
components would consume but never close the given input streams. IMHO
(feel free to disagree) the responsibility of closing a stream should
always be on whoever opened the stream in the first place.

BR,

Jukka Zitting

Re: Tika use cases

Posted by kbennett <kb...@bbsinc.biz>.

Jukka -

Thanks for responding.  What you said made perfect sense.  My domain
knowledge in this area is very limited, so I apologize in advance for that.

So a given parser (e.g. an MS Word document parser) might be instantiated at
its first use with "global" options, that is, options for all parses, and
then each call to extractMetadata would use that instance and be given
file-specific options?  So it might look something like this?:

// in some parser factory class, named, say ParserFactory:

private SomeTikaInterface msWordParser;

SomeTikaInterface getMSWordParser() {
    if msWordParser == null) {
        msWordParser = new MSWordParser( /* the global config options */);
    }
    return msWordParser;
}

// ----------------- and then, where the actual parse needs to be done:

InputStream stream = ...; 
Metadata metadata = new Metadata(); 
myParserFactoryInstance.getMSWordParser().extractMetadata(stream, metadata);

?

- Keith 


Jukka Zitting wrote:
> 
> 
> There are really two kinds of options that could affect the way a
> parser would work. The first kind are generic options like the maximum
> amount of memory or time to use, the location of any temporary files
> to be used, etc. that don't have any direct relation to the specific
> document being parsed. The other kind are parsing hints related to the
> parsed document, like the name (and extension) of the file that
> contains the document, any MIME headers associated with the document
> (for example from a HTTP request or an email body part), etc.
> 

-- 
View this message in context: http://www.nabble.com/Tika-use-cases-tf4287938.html#a12599158
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: Tika use cases

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 9/10/07, kbennett <kb...@bbsinc.biz> wrote:
> It seems to me that options going into the parser are logically different
> from metadata coming out of the parser, and that to maximize the code's
> cohesion (see http://en.wikipedia.org/wiki/Cohesion_%28computer_science%29),
> it would be preferable to express them as two different objects.

There are really two kinds of options that could affect the way a
parser would work. The first kind are generic options like the maximum
amount of memory or time to use, the location of any temporary files
to be used, etc. that don't have any direct relation to the specific
document being parsed. The other kind are parsing hints related to the
parsed document, like the name (and extension) of the file that
contains the document, any MIME headers associated with the document
(for example from a HTTP request or an email body part), etc.

The first kind of options I'd really handle separately as JavaBean
properties or some such of the parser instances, but the second kind
is actually more or less accurate metadata about the document in
question, so IMHO it would make perfect sense to pass that information
as a part of the metadata argument.

> Also, if the metadata is the only output of the parser (as it appears to be
> in the use case), why not have the parser create the metadata object itself,
> and return it as the return value?  This would seem like a more natural
> interface.

As mentioned above, I think the metadata object could (and should) be
used to pass various parsing "hints" to the parser, and that the
parser can then extend, verify, or correct the given metadata. This
approach also allows one to have a sequence of parsers that
incrementally extract more and more information from the input
document.

BR,

Jukka Zitting