You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Karl Wright <da...@gmail.com> on 2011/05/12 10:43:58 UTC
Re: How to Extract Document Content into Solr Using Manifold

I tried out your suggestions here on a freshly-installed Solr 3.1
instance.  Some observations:

(1) The /extract/tika handler does not exist out of the box; the
/update/extract handler still exists though.
(2) For the /update/extract handler, it did not seem like you needed
fmap.content=attr_content as an argument.

So it looks like, for a simple setup, the Solr output connector's
default values worked just fine.  (I had a lot of trouble with Derby
queries running a long time, but that's a different issue).

Karl

On Thu, Apr 21, 2011 at 10:30 AM, Kadri Atalay <at...@gmail.com> wrote:
> Sure Karl, no problem.
>
> My initial assumption was that; when Solr is Setup to use Tika (Solr Cell) ,
> content would be automatically extracted and indexed in Solr.
> But it looks like, field mapping needed to be defined in the ManifoldCF job.
>
> The goal of the project I'm working on is to:
>
> 1-use Solr with Tika (to extract and index MULTIPLE formats of documents),
> 2-use ManifoldCF (to use active directory security to pull user information
> from a domain controller, store ACL for each indexed document),
> 3-perform secure searches on all the indexed documents based on logged in
> user credentials.
>
> One Caveat here is that, the file system I'm using is not a plain vanilla
> FS. It's StorHouse / RFS from FileTek.
>
> So, as I move along, I'll post my findings, and ask for suggestions.
>
> I already got your book, and can't wait to read the connector creation
> chapters !
>
> Thanks,
>
> Kadri
>
>
> On Thu, Apr 21, 2011 at 5:58 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Thanks for doing this.
>>
>> If you have suggestions as to how to modify the default behavior of
>> the Solr output connector given the recent release of Solr 3.1, please
>> consider creating a ticket in Apache JIRA that describes what you
>> think needs to happen.  The output connector was designed to work with
>> the example configuration of Solr by default; I believe it would be
>> good to retain that ability.
>>
>> Karl
>>
>> On Wed, Apr 20, 2011 at 6:49 PM, Kadri Atalay <at...@gmail.com>
>> wrote:
>> > I added the following field mapping into Manifold Job and now it's
>> > indexing
>> > the document content also !
>> >
>> > (fmap.content    attr_content)
>> >
>> > Thanks !
>> >
>> >
>> > On Wed, Apr 20, 2011 at 6:36 PM, Karl Wright <da...@gmail.com> wrote:
>> >>
>> >> The content is posted to the update request handler.  It might be
>> >> helpful if you turn on some logging in Solr to see exactly what is
>> >> happening there.
>> >>
>> >> Karl
>> >>
>> >> On Wed, Apr 20, 2011 at 6:18 PM, Kadri Atalay <at...@gmail.com>
>> >> wrote:
>> >> > I'm able to use Manifold and SharedDrive connector to index files
>> >> > into
>> >> > Solr.
>> >> > But, only information I see in the Solr is Author, Content_type,Name,
>> >> > &
>> >> > last_modified.
>> >> >
>> >> > Can anyone tell me, how to index also the content into Solr ?
>> >> >
>> >> > Thanks in Advance !
>> >> >
>> >> > Kadri
>> >> >
>> >> > PS. I'm using SolrCell (Tika) and manual update/extract is working
>> >> > fine.
>> >> >
>> >
>> >
>
>