You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Matthew Parker <mp...@apogeeintegration.com> on 2012/02/27 15:53:25 UTC

Transforming Manifold Metadata Prior to Pushing the Data into SOLR

I'm trying to push data into SOLR..

Is there a way to transform the metadata coming in from different data
sources like SharePoint, and the File Share, prior to posting it into SOLR?

For instance, documents have metadata specifying their file path. I need to
transform that to a URL I can use within SOLR to retrieve that document
through a servlet that I wrote.

Also, based on specific metadata that I'm seeing in the documents, I might
want to conditionally add populate other fields in SOLR index.

------------------------------
This e-mail and any files transmitted with it may be proprietary.  Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Apogee Integration.

Re: Transforming Manifold Metadata Prior to Pushing the Data into SOLR

Posted by Matthew Parker <mp...@apogeeintegration.com>.
Thanks for the insights Karl. I'll have to give this a little more thought.

On Mon, Feb 27, 2012 at 1:22 PM, Karl Wright <da...@gmail.com> wrote:

> If you've got a mix of data and only some of it comes through
> ManifoldCF, you can still use the ManifoldCF-generated URL for those
> that originate with ManifoldCF.  This should even work for documents
> from the JCIFS connector - even though the default urls from this
> connector are "file:" style, there's a mapping you can set up for
> documents from that connector that maps to a URL format of your
> choice.  Similarly, most JDBC document urls can readily be constructed
> as part of the database queries that you provide for the job.  So it
> does not sound like your servlet would have to do anything custom for
> any of the data that comes from ManifoldCF at this time, as long as
> you define your connections and jobs with some care as to the URLs
> they will produce.
>
> Thanks,
> Karl
>
>
> On Mon, Feb 27, 2012 at 11:25 AM, Matthew Parker
> <mp...@apogeeintegration.com> wrote:
> > Karl,
> >
> > I'm importing data from a number of sources to include: SharePoint, File
> > shares, and an ORACLE database. The files/records are indexed by SOLR.
> >
> > Right now, some of the import is done through custom SOLR's Data Import
> > Handler facilities. I'm hoping to move away from that in the future.
> >
> > We are also aggregating some of the file share data into custom views on
> the
> > web client. Lots of preprocessing.
> >
> > All of this is stored in the SOLR index with metadata related as to how
> to
> > display it within our custom web client. If the result is a certain type,
> > we have custom templates that are display as a result of that.
> >
> > Manifold is a good solution for the SharePoint data. We don't really do
> any
> > custom processing on it other than strip HTML from the text.
> > It's the database and file share information  that adds some challenges.
> I'm
> > hoping to get SOLR out of the text processing pipeline, and just
> > let it index data. We are moving to Pentaho at some point, and we'll
> > probably handle most of the custom metadata processing there.
> > At some point, we'll possibly integrate Pentaho as an output connection
> in
> > Manifold.
> >
> > Thanks,
> >
> > Matt
> >
> > On Mon, Feb 27, 2012 at 10:04 AM, Karl Wright <da...@gmail.com>
> wrote:
> >>
> >> Please see my response interleaved below.
> >>
> >> On Mon, Feb 27, 2012 at 9:53 AM, Matthew Parker
> >> <mp...@apogeeintegration.com> wrote:
> >> > I'm trying to push data into SOLR..
> >> >
> >> > Is there a way to transform the metadata coming in from different data
> >> > sources like SharePoint, and the File Share, prior to posting it into
> >> > SOLR?
> >> >
> >>
> >> In general, ManifoldCF does not have data transformation abilities.
> >> With Solr, we rely on Solr Cell, which is a pipeline built on Tika, to
> >> extract content from documents and to perform transformations to
> >> document metadata etc.  It is possible that at some point it will be
> >> possible to do more transformations in ManifoldCF in order to support
> >> search engines that don't have a pipeline, but that is currently not
> >> available.
> >>
> >> > For instance, documents have metadata specifying their file path. I
> need
> >> > to
> >> > transform that to a URL I can use within SOLR to retrieve that
> document
> >> > through a servlet that I wrote.
> >> >
> >>
> >> The ManifoldCF model is that a connector creates a URL for each
> >> document that it indexes, using whatever makes sense for that
> >> particular repository to get you back to the document in question.
> >> So, for instance, Documentum documents will use URLs that point at
> >> Documentum's Webtop web application.
> >>
> >> It would be helpful to understand more precisely what you are trying
> >> to do.  You could, for instance, modify your servlet to redirect to
> >> the ManifoldCF-generated URL.  It gets indexed into Solr as the "id"
> >> field.
> >>
> >> > Also, based on specific metadata that I'm seeing in the documents, I
> >> > might
> >> > want to conditionally add populate other fields in SOLR index.
> >> >
> >>
> >> That sounds like a job for the Tika pipeline to me.
> >>
> >> Thanks,
> >> Karl
> >>
> >> > ------------------------------
> >> > This e-mail and any files transmitted with it may be proprietary.
> >> >  Please
> >> > note that any views or opinions presented in this e-mail are solely
> >> > those of
> >> > the author and do not necessarily represent those of Apogee
> Integration.
> >> >
> >
> >
> > ------------------------------
> > This e-mail and any files transmitted with it may be proprietary.  Please
> > note that any views or opinions presented in this e-mail are solely
> those of
> > the author and do not necessarily represent those of Apogee Integration.
> >
>

Re: Transforming Manifold Metadata Prior to Pushing the Data into SOLR

Posted by Karl Wright <da...@gmail.com>.
If you've got a mix of data and only some of it comes through
ManifoldCF, you can still use the ManifoldCF-generated URL for those
that originate with ManifoldCF.  This should even work for documents
from the JCIFS connector - even though the default urls from this
connector are "file:" style, there's a mapping you can set up for
documents from that connector that maps to a URL format of your
choice.  Similarly, most JDBC document urls can readily be constructed
as part of the database queries that you provide for the job.  So it
does not sound like your servlet would have to do anything custom for
any of the data that comes from ManifoldCF at this time, as long as
you define your connections and jobs with some care as to the URLs
they will produce.

Thanks,
Karl


On Mon, Feb 27, 2012 at 11:25 AM, Matthew Parker
<mp...@apogeeintegration.com> wrote:
> Karl,
>
> I'm importing data from a number of sources to include: SharePoint, File
> shares, and an ORACLE database. The files/records are indexed by SOLR.
>
> Right now, some of the import is done through custom SOLR's Data Import
> Handler facilities. I'm hoping to move away from that in the future.
>
> We are also aggregating some of the file share data into custom views on the
> web client. Lots of preprocessing.
>
> All of this is stored in the SOLR index with metadata related as to how to
> display it within our custom web client. If the result is a certain type,
> we have custom templates that are display as a result of that.
>
> Manifold is a good solution for the SharePoint data. We don't really do any
> custom processing on it other than strip HTML from the text.
> It's the database and file share information  that adds some challenges. I'm
> hoping to get SOLR out of the text processing pipeline, and just
> let it index data. We are moving to Pentaho at some point, and we'll
> probably handle most of the custom metadata processing there.
> At some point, we'll possibly integrate Pentaho as an output connection in
> Manifold.
>
> Thanks,
>
> Matt
>
> On Mon, Feb 27, 2012 at 10:04 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Please see my response interleaved below.
>>
>> On Mon, Feb 27, 2012 at 9:53 AM, Matthew Parker
>> <mp...@apogeeintegration.com> wrote:
>> > I'm trying to push data into SOLR..
>> >
>> > Is there a way to transform the metadata coming in from different data
>> > sources like SharePoint, and the File Share, prior to posting it into
>> > SOLR?
>> >
>>
>> In general, ManifoldCF does not have data transformation abilities.
>> With Solr, we rely on Solr Cell, which is a pipeline built on Tika, to
>> extract content from documents and to perform transformations to
>> document metadata etc.  It is possible that at some point it will be
>> possible to do more transformations in ManifoldCF in order to support
>> search engines that don't have a pipeline, but that is currently not
>> available.
>>
>> > For instance, documents have metadata specifying their file path. I need
>> > to
>> > transform that to a URL I can use within SOLR to retrieve that document
>> > through a servlet that I wrote.
>> >
>>
>> The ManifoldCF model is that a connector creates a URL for each
>> document that it indexes, using whatever makes sense for that
>> particular repository to get you back to the document in question.
>> So, for instance, Documentum documents will use URLs that point at
>> Documentum's Webtop web application.
>>
>> It would be helpful to understand more precisely what you are trying
>> to do.  You could, for instance, modify your servlet to redirect to
>> the ManifoldCF-generated URL.  It gets indexed into Solr as the "id"
>> field.
>>
>> > Also, based on specific metadata that I'm seeing in the documents, I
>> > might
>> > want to conditionally add populate other fields in SOLR index.
>> >
>>
>> That sounds like a job for the Tika pipeline to me.
>>
>> Thanks,
>> Karl
>>
>> > ------------------------------
>> > This e-mail and any files transmitted with it may be proprietary.
>> >  Please
>> > note that any views or opinions presented in this e-mail are solely
>> > those of
>> > the author and do not necessarily represent those of Apogee Integration.
>> >
>
>
> ------------------------------
> This e-mail and any files transmitted with it may be proprietary.  Please
> note that any views or opinions presented in this e-mail are solely those of
> the author and do not necessarily represent those of Apogee Integration.
>

Re: Transforming Manifold Metadata Prior to Pushing the Data into SOLR

Posted by Matthew Parker <mp...@apogeeintegration.com>.
Karl,

I'm importing data from a number of sources to include: SharePoint, File
shares, and an ORACLE database. The files/records are indexed by SOLR.

Right now, some of the import is done through custom SOLR's Data Import
Handler facilities. I'm hoping to move away from that in the future.

We are also aggregating some of the file share data into custom views on
the web client. Lots of preprocessing.

All of this is stored in the SOLR index with metadata related as to how to
display it within our custom web client. If the result is a certain type,
we have custom templates that are display as a result of that.

Manifold is a good solution for the SharePoint data. We don't really do any
custom processing on it other than strip HTML from the text.
It's the database and file share information  that adds some challenges.
I'm hoping to get SOLR out of the text processing pipeline, and just
let it index data. We are moving to Pentaho at some point, and we'll
probably handle most of the custom metadata processing there.
At some point, we'll possibly integrate Pentaho as an output connection in
Manifold.

Thanks,

Matt

On Mon, Feb 27, 2012 at 10:04 AM, Karl Wright <da...@gmail.com> wrote:

> Please see my response interleaved below.
>
> On Mon, Feb 27, 2012 at 9:53 AM, Matthew Parker
> <mp...@apogeeintegration.com> wrote:
> > I'm trying to push data into SOLR..
> >
> > Is there a way to transform the metadata coming in from different data
> > sources like SharePoint, and the File Share, prior to posting it into
> SOLR?
> >
>
> In general, ManifoldCF does not have data transformation abilities.
> With Solr, we rely on Solr Cell, which is a pipeline built on Tika, to
> extract content from documents and to perform transformations to
> document metadata etc.  It is possible that at some point it will be
> possible to do more transformations in ManifoldCF in order to support
> search engines that don't have a pipeline, but that is currently not
> available.
>
> > For instance, documents have metadata specifying their file path. I need
> to
> > transform that to a URL I can use within SOLR to retrieve that document
> > through a servlet that I wrote.
> >
>
> The ManifoldCF model is that a connector creates a URL for each
> document that it indexes, using whatever makes sense for that
> particular repository to get you back to the document in question.
> So, for instance, Documentum documents will use URLs that point at
> Documentum's Webtop web application.
>
> It would be helpful to understand more precisely what you are trying
> to do.  You could, for instance, modify your servlet to redirect to
> the ManifoldCF-generated URL.  It gets indexed into Solr as the "id"
> field.
>
> > Also, based on specific metadata that I'm seeing in the documents, I
> might
> > want to conditionally add populate other fields in SOLR index.
> >
>
> That sounds like a job for the Tika pipeline to me.
>
> Thanks,
> Karl
>
> > ------------------------------
> > This e-mail and any files transmitted with it may be proprietary.  Please
> > note that any views or opinions presented in this e-mail are solely
> those of
> > the author and do not necessarily represent those of Apogee Integration.
> >
>

------------------------------
This e-mail and any files transmitted with it may be proprietary.  Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Apogee Integration.

Re: Transforming Manifold Metadata Prior to Pushing the Data into SOLR

Posted by Karl Wright <da...@gmail.com>.
Please see my response interleaved below.

On Mon, Feb 27, 2012 at 9:53 AM, Matthew Parker
<mp...@apogeeintegration.com> wrote:
> I'm trying to push data into SOLR..
>
> Is there a way to transform the metadata coming in from different data
> sources like SharePoint, and the File Share, prior to posting it into SOLR?
>

In general, ManifoldCF does not have data transformation abilities.
With Solr, we rely on Solr Cell, which is a pipeline built on Tika, to
extract content from documents and to perform transformations to
document metadata etc.  It is possible that at some point it will be
possible to do more transformations in ManifoldCF in order to support
search engines that don't have a pipeline, but that is currently not
available.

> For instance, documents have metadata specifying their file path. I need to
> transform that to a URL I can use within SOLR to retrieve that document
> through a servlet that I wrote.
>

The ManifoldCF model is that a connector creates a URL for each
document that it indexes, using whatever makes sense for that
particular repository to get you back to the document in question.
So, for instance, Documentum documents will use URLs that point at
Documentum's Webtop web application.

It would be helpful to understand more precisely what you are trying
to do.  You could, for instance, modify your servlet to redirect to
the ManifoldCF-generated URL.  It gets indexed into Solr as the "id"
field.

> Also, based on specific metadata that I'm seeing in the documents, I might
> want to conditionally add populate other fields in SOLR index.
>

That sounds like a job for the Tika pipeline to me.

Thanks,
Karl

> ------------------------------
> This e-mail and any files transmitted with it may be proprietary.  Please
> note that any views or opinions presented in this e-mail are solely those of
> the author and do not necessarily represent those of Apogee Integration.
>