You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/08/05 15:08:14 UTC
How to avoid splitting strings when indexing to solr
Hello people,
I was just wondering how to avoid that the content-type string is split
in to multiple values.
For example: If a document has the content-type: "Application/pdf" it is
broken into three pieces "Application/pdf", "Application", "pdf" in the
solr filed type.
I am not sure if this is done by nutch, or if it is an index topic in solr.
Sure someone knows the answer to that.
Thank you.
Re: How to avoid splitting strings when indexing to solr
Posted by Markus Jelsma <ma...@openindex.io>.
it is in nutch-default of 1.3 only. If you upgraded and copied over the 1.2
conf you'll miss it indeed.
> On 07.08.2011 15:35, Markus Jelsma wrote:
> > 700 <property>
> > 701 <name>moreIndexingFilter.indexMimeTypeParts</name>
> > 702 <value>true</value>
> > 703 <description>Determines whether the index-more plugin will split the
> > mime- type
> > 704 in sub parts, this requires the type field to be multi valued. Set
> > to true for backward
> > 705 compatibility. False will not split the mime-type.
> > 706 </description>
> > 707 </property>
>
> Thank you very much Markus,
>
> I have copied this to my nutch-site.xml. It works very well now.
>
> But I hadn't this option in my nutch-default.xml. Is there a standard
> way to get informed about the options that I can pass to a plugin?
>
> >> Hello people,
> >>
> >> I was just wondering how to avoid that the content-type string is split
> >> in to multiple values.
> >> For example: If a document has the content-type: "Application/pdf" it is
> >> broken into three pieces "Application/pdf", "Application", "pdf" in the
> >> solr filed type.
> >>
> >> I am not sure if this is done by nutch, or if it is an index topic in
> >> solr.
> >>
> >> Sure someone knows the answer to that.
> >>
> >> Thank you.
Re: How to avoid splitting strings when indexing to solr
Posted by Marek Bachmann <m....@uni-kassel.de>.
On 07.08.2011 15:35, Markus Jelsma wrote:
> 700 <property>
> 701 <name>moreIndexingFilter.indexMimeTypeParts</name>
> 702 <value>true</value>
> 703 <description>Determines whether the index-more plugin will split the mime-
> type
> 704 in sub parts, this requires the type field to be multi valued. Set to true
> for backward
> 705 compatibility. False will not split the mime-type.
> 706 </description>
> 707 </property>
>
Thank you very much Markus,
I have copied this to my nutch-site.xml. It works very well now.
But I hadn't this option in my nutch-default.xml. Is there a standard
way to get informed about the options that I can pass to a plugin?
>
>> Hello people,
>>
>> I was just wondering how to avoid that the content-type string is split
>> in to multiple values.
>> For example: If a document has the content-type: "Application/pdf" it is
>> broken into three pieces "Application/pdf", "Application", "pdf" in the
>> solr filed type.
>>
>> I am not sure if this is done by nutch, or if it is an index topic in solr.
>>
>> Sure someone knows the answer to that.
>>
>> Thank you.
Re: How to avoid splitting strings when indexing to solr
Posted by Markus Jelsma <ma...@openindex.io>.
700 <property>
701 <name>moreIndexingFilter.indexMimeTypeParts</name>
702 <value>true</value>
703 <description>Determines whether the index-more plugin will split the mime-
type
704 in sub parts, this requires the type field to be multi valued. Set to true
for backward
705 compatibility. False will not split the mime-type.
706 </description>
707 </property>
> Hello people,
>
> I was just wondering how to avoid that the content-type string is split
> in to multiple values.
> For example: If a document has the content-type: "Application/pdf" it is
> broken into three pieces "Application/pdf", "Application", "pdf" in the
> solr filed type.
>
> I am not sure if this is done by nutch, or if it is an index topic in solr.
>
> Sure someone knows the answer to that.
>
> Thank you.
Re: How to avoid splitting strings when indexing to solr
Posted by Marek Bachmann <m....@uni-kassel.de>.
On 05.08.2011 18:16, Gora Mohanty wrote:
> Hi,
>
> Not too familiar these days
> with Nutch, but my guess is
> that a Solr analyser is getting applied. To have a field exactly as is, use
> the String fieldtype on Solr's schema.xml rather than tje text fieldtype.
>
> Regards,
> Gora
Hi Gora,
thank you for your answer. The field was already a String. The splitting
was done by nutch index-more plugin an passed to the type field, which
is multivalued.
But good to know for future use, that the string type is not processed
by solr.
Thank you.
> On 05-Aug-2011 6:35 PM, "Marek Bachmann"<m....@uni-kassel.de> wrote:
>> Hello people,
>>
>> I was just wondering how to avoid that the content-type string is split
>> in to multiple values.
>> For example: If a document has the content-type: "Application/pdf" it is
>> broken into three pieces "Application/pdf", "Application", "pdf" in the
>> solr filed type.
>>
>> I am not sure if this is done by nutch, or if it is an index topic in
> solr.
>>
>> Sure someone knows the answer to that.
>>
>> Thank you.
>
Re: How to avoid splitting strings when indexing to solr
Posted by Gora Mohanty <go...@mimirtech.com>.
Hi,
Not too familiar these days
with Nutch, but my guess is
that a Solr analyser is getting applied. To have a field exactly as is, use
the String fieldtype on Solr's schema.xml rather than tje text fieldtype.
Regards,
Gora
On 05-Aug-2011 6:35 PM, "Marek Bachmann" <m....@uni-kassel.de> wrote:
> Hello people,
>
> I was just wondering how to avoid that the content-type string is split
> in to multiple values.
> For example: If a document has the content-type: "Application/pdf" it is
> broken into three pieces "Application/pdf", "Application", "pdf" in the
> solr filed type.
>
> I am not sure if this is done by nutch, or if it is an index topic in
solr.
>
> Sure someone knows the answer to that.
>
> Thank you.