You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Cihad Guzel <cg...@gmail.com> on 2016/10/31 10:08:59 UTC

Re: I don't see my file content in solr index

Hi Karl,

I try output connector with solr 4.4.0, solr 5.5.3 and solr 6.2.1 .

If I use solr 4.4.0, everything looks OK in Simple History and in Solr
index.
If I use solr 5.5.3 and solr 6.2.1, everything looks OK in Simple History,
but I don't see my files content in solr index. I could see another fields
(mimetype, contenttype etc.).

I followed http network using wireshark. I have seen the content from it.
So, the problem can be on the Solr side. But I doubt that manifoldcf
support solr 5.x and solr 6.x. Could I need another setting for solr 5.x
and 6.x?

Thanks,
Cihad Guzel


2016-08-23 4:19 GMT+03:00 Karl Wright <da...@gmail.com>:

> Hi Cihad,
>
> When you say "the file is indexed successfully", what do you mean?  Do you
> mean that the Simple History shows a successful index attempt for the PDF
> file in question?  Or does it show that the document was rejected for some
> reason?
>
> If everything looks OK in the Simple History, then the problem has to be
> on the Solr side.  Please look at the Solr logs to see if the document was
> sent in, and what Solr did with it.
>
> Thanks,
> Karl
>
>
> On Mon, Aug 22, 2016 at 12:54 PM, Cihad Guzel <cg...@gmail.com> wrote:
>
>> Hi
>>
>> I am new for ManifoldCF. I have defined a job and run it. The file is
>> indexed successfully. All metadata is indexed, but I don't see the pdf file
>> content in solr index. What could been the reason?
>>
>> Thanks
>> Cihad Guzel
>>
>
>


-- 
Teşekkürler
Cihad Güzel

Re: I don't see my file content in solr index

Posted by Shinichiro Abe <sh...@gmail.com>.
Hi,

Parhaps, because of Solr configuration?
If you use data driven schema and index via ExtractingRequestHandler
on Solr side, you can change fmap.content field to stored field name,
then you can see the content in Solr response. fmap.content field is
_text_ which is not stored field(but it is searchable) in out of the
box schema. If you turn off using ExtractingRequestHandler on
ManifoldCF side, you have to configure content field name into stored
field name such as content_txt. You can' t see the content response or
get highlighting snippets unless the content field is stored.

Regards,
Shinichiro Abe

2016-10-31 20:11 GMT+09:00 Karl Wright <da...@gmail.com>:
> Hi Cihad,
>
> Do you have Solr logs?  I would have a look at those.  You should see posts
> to them.
>
> It's also possible that your testing methodology is flawed.  Because MCF is
> incremental, people sometimes forget that if you change output configuration
> (by pointing at a different solr index, for instance), MCF may not realize
> that your documents need reindexing.  There's a button you can click on the
> View page for the Solr connection that makes MCF forget what it did and
> reindex everything.  But if you see "document index" entries in the Simple
> History then that is not what the problem is.
>
> We recently updated SolrJ to the 5.x version, and will updated it to the 6.x
> version as soon as an unwelcome dependency is removed, so that might be
> related to your issue, but Solr is usually pretty good about not silently
> accepting documents and just eating them without a fuss.  So I don't think
> that is the issue.  Have a look at the Solr logs.
>
> Karl
>
>
> On Mon, Oct 31, 2016 at 6:08 AM, Cihad Guzel <cg...@gmail.com> wrote:
>>
>> Hi Karl,
>>
>> I try output connector with solr 4.4.0, solr 5.5.3 and solr 6.2.1 .
>>
>> If I use solr 4.4.0, everything looks OK in Simple History and in Solr
>> index.
>> If I use solr 5.5.3 and solr 6.2.1, everything looks OK in Simple History,
>> but I don't see my files content in solr index. I could see another fields
>> (mimetype, contenttype etc.).
>>
>> I followed http network using wireshark. I have seen the content from it.
>> So, the problem can be on the Solr side. But I doubt that manifoldcf support
>> solr 5.x and solr 6.x. Could I need another setting for solr 5.x and 6.x?
>>
>> Thanks,
>> Cihad Guzel
>>
>>
>> 2016-08-23 4:19 GMT+03:00 Karl Wright <da...@gmail.com>:
>>>
>>> Hi Cihad,
>>>
>>> When you say "the file is indexed successfully", what do you mean?  Do
>>> you mean that the Simple History shows a successful index attempt for the
>>> PDF file in question?  Or does it show that the document was rejected for
>>> some reason?
>>>
>>> If everything looks OK in the Simple History, then the problem has to be
>>> on the Solr side.  Please look at the Solr logs to see if the document was
>>> sent in, and what Solr did with it.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Mon, Aug 22, 2016 at 12:54 PM, Cihad Guzel <cg...@gmail.com> wrote:
>>>>
>>>> Hi
>>>>
>>>> I am new for ManifoldCF. I have defined a job and run it. The file is
>>>> indexed successfully. All metadata is indexed, but I don't see the pdf file
>>>> content in solr index. What could been the reason?
>>>>
>>>> Thanks
>>>> Cihad Guzel
>>>
>>>
>>
>>
>>
>> --
>> Teşekkürler
>> Cihad Güzel
>
>

Re: I don't see my file content in solr index

Posted by Karl Wright <da...@gmail.com>.
Hi Cihad,

Do you have Solr logs?  I would have a look at those.  You should see posts
to them.

It's also possible that your testing methodology is flawed.  Because MCF is
incremental, people sometimes forget that if you change output
configuration (by pointing at a different solr index, for instance), MCF
may not realize that your documents need reindexing.  There's a button you
can click on the View page for the Solr connection that makes MCF forget
what it did and reindex everything.  But if you see "document index"
entries in the Simple History then that is not what the problem is.

We recently updated SolrJ to the 5.x version, and will updated it to the
6.x version as soon as an unwelcome dependency is removed, so that might be
related to your issue, but Solr is usually pretty good about not silently
accepting documents and just eating them without a fuss.  So I don't think
that is the issue.  Have a look at the Solr logs.

Karl


On Mon, Oct 31, 2016 at 6:08 AM, Cihad Guzel <cg...@gmail.com> wrote:

> Hi Karl,
>
> I try output connector with solr 4.4.0, solr 5.5.3 and solr 6.2.1 .
>
> If I use solr 4.4.0, everything looks OK in Simple History and in Solr
> index.
> If I use solr 5.5.3 and solr 6.2.1, everything looks OK in Simple History,
> but I don't see my files content in solr index. I could see another fields
> (mimetype, contenttype etc.).
>
> I followed http network using wireshark. I have seen the content from it.
> So, the problem can be on the Solr side. But I doubt that manifoldcf
> support solr 5.x and solr 6.x. Could I need another setting for solr 5.x
> and 6.x?
>
> Thanks,
> Cihad Guzel
>
>
> 2016-08-23 4:19 GMT+03:00 Karl Wright <da...@gmail.com>:
>
>> Hi Cihad,
>>
>> When you say "the file is indexed successfully", what do you mean?  Do
>> you mean that the Simple History shows a successful index attempt for the
>> PDF file in question?  Or does it show that the document was rejected for
>> some reason?
>>
>> If everything looks OK in the Simple History, then the problem has to be
>> on the Solr side.  Please look at the Solr logs to see if the document was
>> sent in, and what Solr did with it.
>>
>> Thanks,
>> Karl
>>
>>
>> On Mon, Aug 22, 2016 at 12:54 PM, Cihad Guzel <cg...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I am new for ManifoldCF. I have defined a job and run it. The file is
>>> indexed successfully. All metadata is indexed, but I don't see the pdf file
>>> content in solr index. What could been the reason?
>>>
>>> Thanks
>>> Cihad Guzel
>>>
>>
>>
>
>
> --
> Teşekkürler
> Cihad Güzel
>