You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/08/05 15:08:14 UTC

How to avoid splitting strings when indexing to solr

Hello people,

I was just wondering how to avoid that the content-type string is split 
in to multiple values.
For example: If a document has the content-type: "Application/pdf" it is 
broken into three pieces "Application/pdf", "Application", "pdf" in the 
solr filed type.

I am not sure if this is done by nutch, or if it is an index topic in solr.

Sure someone knows the answer to that.

Thank you.

Re: How to avoid splitting strings when indexing to solr

Posted by Markus Jelsma <ma...@openindex.io>.

 it is in nutch-default of 1.3 only. If you upgraded and copied over the 1.2 
conf you'll miss it indeed.

> On 07.08.2011 15:35, Markus Jelsma wrote:
> > 700 	<property>
> > 701 	<name>moreIndexingFilter.indexMimeTypeParts</name>
> > 702 	<value>true</value>
> > 703 	<description>Determines whether the index-more plugin will split the
> > mime- type
> > 704 	in sub parts, this requires the type field to be multi valued. Set
> > to true for backward
> > 705 	compatibility. False will not split the mime-type.
> > 706 	</description>
> > 707 	</property>
> 
> Thank you very much Markus,
> 
> I have copied this to my nutch-site.xml. It works very well now.
> 
> But I hadn't this option in my nutch-default.xml. Is there a standard
> way to get informed about the options that I can pass to a plugin?
> 
> >> Hello people,
> >> 
> >> I was just wondering how to avoid that the content-type string is split
> >> in to multiple values.
> >> For example: If a document has the content-type: "Application/pdf" it is
> >> broken into three pieces "Application/pdf", "Application", "pdf" in the
> >> solr filed type.
> >> 
> >> I am not sure if this is done by nutch, or if it is an index topic in
> >> solr.
> >> 
> >> Sure someone knows the answer to that.
> >> 
> >> Thank you.

Re: How to avoid splitting strings when indexing to solr

Posted by Marek Bachmann <m....@uni-kassel.de>.

On 07.08.2011 15:35, Markus Jelsma wrote:
> 700 	<property>
> 701 	<name>moreIndexingFilter.indexMimeTypeParts</name>
> 702 	<value>true</value>
> 703 	<description>Determines whether the index-more plugin will split the mime-
> type
> 704 	in sub parts, this requires the type field to be multi valued. Set to true
> for backward
> 705 	compatibility. False will not split the mime-type.
> 706 	</description>
> 707 	</property>
>
Thank you very much Markus,

I have copied this to my nutch-site.xml. It works very well now.

But I hadn't this option in my nutch-default.xml. Is there a standard 
way to get informed about the options that I can pass to a plugin?


>
>> Hello people,
>>
>> I was just wondering how to avoid that the content-type string is split
>> in to multiple values.
>> For example: If a document has the content-type: "Application/pdf" it is
>> broken into three pieces "Application/pdf", "Application", "pdf" in the
>> solr filed type.
>>
>> I am not sure if this is done by nutch, or if it is an index topic in solr.
>>
>> Sure someone knows the answer to that.
>>
>> Thank you.

Re: How to avoid splitting strings when indexing to solr

Posted by Markus Jelsma <ma...@openindex.io>.

700 	<property>
701 	<name>moreIndexingFilter.indexMimeTypeParts</name>
702 	<value>true</value>
703 	<description>Determines whether the index-more plugin will split the mime-
type
704 	in sub parts, this requires the type field to be multi valued. Set to true 
for backward
705 	compatibility. False will not split the mime-type.
706 	</description>
707 	</property> 


> Hello people,
> 
> I was just wondering how to avoid that the content-type string is split
> in to multiple values.
> For example: If a document has the content-type: "Application/pdf" it is
> broken into three pieces "Application/pdf", "Application", "pdf" in the
> solr filed type.
> 
> I am not sure if this is done by nutch, or if it is an index topic in solr.
> 
> Sure someone knows the answer to that.
> 
> Thank you.

Re: How to avoid splitting strings when indexing to solr

Posted by Marek Bachmann <m....@uni-kassel.de>.

On 05.08.2011 18:16, Gora Mohanty wrote:
> Hi,
>
> Not too familiar these days
> with Nutch, but my guess is
> that a Solr analyser is getting applied. To have a field exactly as is, use
> the String fieldtype on Solr's schema.xml rather than tje text fieldtype.
>
> Regards,
> Gora
Hi Gora,

thank you for your answer. The field was already a String. The splitting 
was done by nutch index-more plugin an passed to the type field, which 
is multivalued.
But good to know for future use, that the string type is not processed 
by solr.

Thank you.


> On 05-Aug-2011 6:35 PM, "Marek Bachmann"<m....@uni-kassel.de>  wrote:
>> Hello people,
>>
>> I was just wondering how to avoid that the content-type string is split
>> in to multiple values.
>> For example: If a document has the content-type: "Application/pdf" it is
>> broken into three pieces "Application/pdf", "Application", "pdf" in the
>> solr filed type.
>>
>> I am not sure if this is done by nutch, or if it is an index topic in
> solr.
>>
>> Sure someone knows the answer to that.
>>
>> Thank you.
>

Re: How to avoid splitting strings when indexing to solr

Posted by Gora Mohanty <go...@mimirtech.com>.

Hi,

Not too familiar these days
with Nutch, but my guess is
that a Solr analyser is getting applied. To have a field exactly as is, use
the String fieldtype on Solr's schema.xml rather than tje text fieldtype.

Regards,
Gora
On 05-Aug-2011 6:35 PM, "Marek Bachmann" <m....@uni-kassel.de> wrote:
> Hello people,
>
> I was just wondering how to avoid that the content-type string is split
> in to multiple values.
> For example: If a document has the content-type: "Application/pdf" it is
> broken into three pieces "Application/pdf", "Application", "pdf" in the
> solr filed type.
>
> I am not sure if this is done by nutch, or if it is an index topic in
solr.
>
> Sure someone knows the answer to that.
>
> Thank you.