You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Raymond Wiker <rw...@gmail.com> on 2013/08/02 15:16:30 UTC

Re: DataImportHandler, BlobTransformer, FieldReaderDataSource and TikaEntityExtractor

It appears that this is simpler than I thought: in SOLR 4.4, at least,
there is a dataSource class named "FieldStreamDataSource" that I can use
directly with the TikaEntityProcessor. Given a blob column named DOCIMAGE,
I can use the following Tika entity:

  <dataSource type="FieldStreamDataSource" name="fieldstream"/>
 ...
      <entity name="tika" processor="TikaEntityProcessor"
dataField="outer.DOCIMAGE" dataSource="fieldstream" format="xml">
        <!--Do appropriate mapping here  meta="true" means it is a metadata
field -->
        <field column="Author" meta="true" name="xmauthor"/>
        <field column="title" meta="true" name="title"/>
        <!--'text' is an implicit field emited by TikaEntityProcessor . Map
it appropriately-->
        <field column="text" name="content"/>
        <field column="content_type" name="content_type" meta="true"/>
        <field column="last_modified" name="last_modified" meta="true"/>
    </entity>

This gives me document text extracted title and author, as expected.

What I haven't been able to do, is to extract content_type and
last_modified (last_modified may not be possible, unless there is an
in-document property), but content_type should be detected by the parser.
My best guess for this is that it is simply called something else --- but
content_type (and last_modified) are the names used by
ExtractingRequestHandler.




On Tue, Jul 30, 2013 at 9:49 AM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> There's no BlobTransformer in DataImportHandler. You'll have to write one.
> Also, you'd probably need to write a FieldInputStreamDataSource instead of
> FieldReaderDataSource.
>
>
> On Tue, Jul 30, 2013 at 12:30 PM, Raymond Wiker <rw...@gmail.com> wrote:
>
> > I have a case where I want to documents and metadata content from a
> > datebase. The metadata is is not a problem, but it does not appear that I
> > can handle the document content (held as BLOBS in the database) with
> > out-of-the-box SOLR 4.4 functionality.
> >
> > I was hoping to to be able to solve this by doing something like the
> > following:
> >
> > *DataImportHandler *extracts all the columns (fields), including the
> > document (BLOB)
> >
> > *BlobTransformer *to extract the BLOB content
> >
> > *FieldReaderDataSource *as a bridge between the extracted BLOB and and
> Tika
> >
> > *TikeEntityExtractor *to extract the text and embedded metadata from the
> > BLOB.
> >
> > The first problem is that "BlobTransfomer" does not appear to exist. It
> > could be that I need to load some additional jar files, or it could be
> that
> > the "BlobTransfomer" functionality is simply not part of the Solr
> > distribution.
> >
> > Is there a way of handling this type of content using DataImportHandler,
> or
> > do I need to write an external connector for it?
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>