You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Kadri Atalay <at...@gmail.com> on 2011/04/21 00:18:06 UTC

How to Extract Document Content into Solr Using Manifold

I'm able to use Manifold and SharedDrive connector to index files into Solr.
But, only information I see in the Solr is Author, Content_type,Name, &
last_modified.

Can anyone tell me, how to index also the content into Solr ?

Thanks in Advance !

Kadri

PS. I'm using SolrCell (Tika) and manual update/extract is working fine.

Re: How to Extract Document Content into Solr Using Manifold

Posted by Kadri Atalay <at...@gmail.com>.

OK, Thanks.

On Wed, Apr 20, 2011 at 6:36 PM, Karl Wright <da...@gmail.com> wrote:

> The content is posted to the update request handler.  It might be
> helpful if you turn on some logging in Solr to see exactly what is
> happening there.
>
> Karl
>
> On Wed, Apr 20, 2011 at 6:18 PM, Kadri Atalay <at...@gmail.com>
> wrote:
> > I'm able to use Manifold and SharedDrive connector to index files into
> Solr.
> > But, only information I see in the Solr is Author, Content_type,Name, &
> > last_modified.
> >
> > Can anyone tell me, how to index also the content into Solr ?
> >
> > Thanks in Advance !
> >
> > Kadri
> >
> > PS. I'm using SolrCell (Tika) and manual update/extract is working fine.
> >
>

Re: How to Extract Document Content into Solr Using Manifold

Posted by Karl Wright <da...@gmail.com>.

I tried out your suggestions here on a freshly-installed Solr 3.1
instance.  Some observations:

(1) The /extract/tika handler does not exist out of the box; the
/update/extract handler still exists though.
(2) For the /update/extract handler, it did not seem like you needed
fmap.content=attr_content as an argument.

So it looks like, for a simple setup, the Solr output connector's
default values worked just fine.  (I had a lot of trouble with Derby
queries running a long time, but that's a different issue).

Karl

On Thu, Apr 21, 2011 at 10:30 AM, Kadri Atalay <at...@gmail.com> wrote:
> Sure Karl, no problem.
>
> My initial assumption was that; when Solr is Setup to use Tika (Solr Cell) ,
> content would be automatically extracted and indexed in Solr.
> But it looks like, field mapping needed to be defined in the ManifoldCF job.
>
> The goal of the project I'm working on is to:
>
> 1-use Solr with Tika (to extract and index MULTIPLE formats of documents),
> 2-use ManifoldCF (to use active directory security to pull user information
> from a domain controller, store ACL for each indexed document),
> 3-perform secure searches on all the indexed documents based on logged in
> user credentials.
>
> One Caveat here is that, the file system I'm using is not a plain vanilla
> FS. It's StorHouse / RFS from FileTek.
>
> So, as I move along, I'll post my findings, and ask for suggestions.
>
> I already got your book, and can't wait to read the connector creation
> chapters !
>
> Thanks,
>
> Kadri
>
>
> On Thu, Apr 21, 2011 at 5:58 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Thanks for doing this.
>>
>> If you have suggestions as to how to modify the default behavior of
>> the Solr output connector given the recent release of Solr 3.1, please
>> consider creating a ticket in Apache JIRA that describes what you
>> think needs to happen.  The output connector was designed to work with
>> the example configuration of Solr by default; I believe it would be
>> good to retain that ability.
>>
>> Karl
>>
>> On Wed, Apr 20, 2011 at 6:49 PM, Kadri Atalay <at...@gmail.com>
>> wrote:
>> > I added the following field mapping into Manifold Job and now it's
>> > indexing
>> > the document content also !
>> >
>> > (fmap.content    attr_content)
>> >
>> > Thanks !
>> >
>> >
>> > On Wed, Apr 20, 2011 at 6:36 PM, Karl Wright <da...@gmail.com> wrote:
>> >>
>> >> The content is posted to the update request handler.  It might be
>> >> helpful if you turn on some logging in Solr to see exactly what is
>> >> happening there.
>> >>
>> >> Karl
>> >>
>> >> On Wed, Apr 20, 2011 at 6:18 PM, Kadri Atalay <at...@gmail.com>
>> >> wrote:
>> >> > I'm able to use Manifold and SharedDrive connector to index files
>> >> > into
>> >> > Solr.
>> >> > But, only information I see in the Solr is Author, Content_type,Name,
>> >> > &
>> >> > last_modified.
>> >> >
>> >> > Can anyone tell me, how to index also the content into Solr ?
>> >> >
>> >> > Thanks in Advance !
>> >> >
>> >> > Kadri
>> >> >
>> >> > PS. I'm using SolrCell (Tika) and manual update/extract is working
>> >> > fine.
>> >> >
>> >
>> >
>
>

Re: How to Extract Document Content into Solr Using Manifold

Posted by Kadri Atalay <at...@gmail.com>.

Sure Karl, no problem.

My initial assumption was that; when Solr is Setup to use Tika (Solr Cell) ,
content would be automatically extracted and indexed in Solr.
But it looks like, field mapping needed to be defined in the ManifoldCF job.

The goal of the project I'm working on is to:

1-use Solr with Tika (to extract and index MULTIPLE formats of documents),
2-use ManifoldCF (to use active directory security to pull user information
from a domain controller, store ACL for each indexed document),
3-perform secure searches on all the indexed documents based on logged in
user credentials.

One Caveat here is that, the file system I'm using is not a plain vanilla
FS. It's StorHouse / RFS from FileTek.

So, as I move along, I'll post my findings, and ask for suggestions.

I already got your book, and can't wait to read the connector creation
chapters !

Thanks,

Kadri

On Thu, Apr 21, 2011 at 5:58 AM, Karl Wright <da...@gmail.com> wrote:

> Thanks for doing this.
>
> If you have suggestions as to how to modify the default behavior of
> the Solr output connector given the recent release of Solr 3.1, please
> consider creating a ticket in Apache JIRA that describes what you
> think needs to happen.  The output connector was designed to work with
> the example configuration of Solr by default; I believe it would be
> good to retain that ability.
>
> Karl
>
> On Wed, Apr 20, 2011 at 6:49 PM, Kadri Atalay <at...@gmail.com>
> wrote:
> > I added the following field mapping into Manifold Job and now it's
> indexing
> > the document content also !
> >
> > (fmap.content    attr_content)
> >
> > Thanks !
> >
> >
> > On Wed, Apr 20, 2011 at 6:36 PM, Karl Wright <da...@gmail.com> wrote:
> >>
> >> The content is posted to the update request handler.  It might be
> >> helpful if you turn on some logging in Solr to see exactly what is
> >> happening there.
> >>
> >> Karl
> >>
> >> On Wed, Apr 20, 2011 at 6:18 PM, Kadri Atalay <at...@gmail.com>
> >> wrote:
> >> > I'm able to use Manifold and SharedDrive connector to index files into
> >> > Solr.
> >> > But, only information I see in the Solr is Author, Content_type,Name,
> &
> >> > last_modified.
> >> >
> >> > Can anyone tell me, how to index also the content into Solr ?
> >> >
> >> > Thanks in Advance !
> >> >
> >> > Kadri
> >> >
> >> > PS. I'm using SolrCell (Tika) and manual update/extract is working
> fine.
> >> >
> >
> >
>

Re: How to Extract Document Content into Solr Using Manifold

Posted by Karl Wright <da...@gmail.com>.

Thanks for doing this.

If you have suggestions as to how to modify the default behavior of
the Solr output connector given the recent release of Solr 3.1, please
consider creating a ticket in Apache JIRA that describes what you
think needs to happen.  The output connector was designed to work with
the example configuration of Solr by default; I believe it would be
good to retain that ability.

Karl

On Wed, Apr 20, 2011 at 6:49 PM, Kadri Atalay <at...@gmail.com> wrote:
> I added the following field mapping into Manifold Job and now it's indexing
> the document content also !
>
> (fmap.content    attr_content)
>
> Thanks !
>
>
> On Wed, Apr 20, 2011 at 6:36 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> The content is posted to the update request handler.  It might be
>> helpful if you turn on some logging in Solr to see exactly what is
>> happening there.
>>
>> Karl
>>
>> On Wed, Apr 20, 2011 at 6:18 PM, Kadri Atalay <at...@gmail.com>
>> wrote:
>> > I'm able to use Manifold and SharedDrive connector to index files into
>> > Solr.
>> > But, only information I see in the Solr is Author, Content_type,Name, &
>> > last_modified.
>> >
>> > Can anyone tell me, how to index also the content into Solr ?
>> >
>> > Thanks in Advance !
>> >
>> > Kadri
>> >
>> > PS. I'm using SolrCell (Tika) and manual update/extract is working fine.
>> >
>
>

Re: How to Extract Document Content into Solr Using Manifold

Posted by Kadri Atalay <at...@gmail.com>.

I added the following field mapping into Manifold Job and now it's indexing
the document content also !

(fmap.content    attr_content)

Thanks !


On Wed, Apr 20, 2011 at 6:36 PM, Karl Wright <da...@gmail.com> wrote:

> The content is posted to the update request handler.  It might be
> helpful if you turn on some logging in Solr to see exactly what is
> happening there.
>
> Karl
>
> On Wed, Apr 20, 2011 at 6:18 PM, Kadri Atalay <at...@gmail.com>
> wrote:
> > I'm able to use Manifold and SharedDrive connector to index files into
> Solr.
> > But, only information I see in the Solr is Author, Content_type,Name, &
> > last_modified.
> >
> > Can anyone tell me, how to index also the content into Solr ?
> >
> > Thanks in Advance !
> >
> > Kadri
> >
> > PS. I'm using SolrCell (Tika) and manual update/extract is working fine.
> >
>

Re: How to Extract Document Content into Solr Using Manifold

Posted by Karl Wright <da...@gmail.com>.

The content is posted to the update request handler.  It might be
helpful if you turn on some logging in Solr to see exactly what is
happening there.

Karl

On Wed, Apr 20, 2011 at 6:18 PM, Kadri Atalay <at...@gmail.com> wrote:
> I'm able to use Manifold and SharedDrive connector to index files into Solr.
> But, only information I see in the Solr is Author, Content_type,Name, &
> last_modified.
>
> Can anyone tell me, how to index also the content into Solr ?
>
> Thanks in Advance !
>
> Kadri
>
> PS. I'm using SolrCell (Tika) and manual update/extract is working fine.
>