You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Markus Jelsma <ma...@buyways.nl> on 2010/09/08 11:27:18 UTC

Mime type via index-more plugin

Hi,

I'm testing the index-more plug-in but, to my surprise, it is defined as a 
multi valued field in the shipped Solr schema configuration! Since when do 
files have more than one mime type?

Well, they don't! It seems the plug-in splits mime types by slash and exports 
three terms per document, 'text', 'plain' and 'text/plain'. But, in my 
opinion, the plug-in should just export 'text/plain', i can do the 
tokenization myself and i don't want my index to be polluted with this non-
information =)

Anyone knows how to configure the index-more plug-in? The wiki isn't very 
helpful.

Cheers,

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Mime type via index-more plugin

Posted by Markus Jelsma <ma...@buyways.nl>.

Done for 1.2 and trunk. Hopefully not too late for inclusion in the 1.2 
release. Please, do test the trunk patch as i couldn't as of yet.

On Wednesday 08 September 2010 13:12:49 Julien Nioche wrote:
> > Perhaps someone could give a pointer on how to read a configuration
> > setting for a plug-in and where to store the setting (Nutch config or
> > plugin.xml) and
> > i might actually write my first Java code again since four years!
> 
> You'd typically do that by adding something like
> *
> conf.getBoolean("moreIndexingFilter.indexMimeTypeParts", true);*
> 
> to the method *setConf(Configuration conf)*
> 
> then use the boolean accordingly in the method addType()
> 
> The value for *moreIndexingFilter.indexMimeTypeParts *can then be specified
> like any other Nutch param i.e. using nutch-site.xml or on the command line
> with -D
> 
> Please submit a patch for both 1.2 and the trunk
> 
> Thanks
> 
> Julien
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Mime type via index-more plugin

Posted by Julien Nioche <li...@gmail.com>.

> Perhaps someone could give a pointer on how to read a configuration setting
> for a plug-in and where to store the setting (Nutch config or plugin.xml)
> and
> i might actually write my first Java code again since four years!
>

You'd typically do that by adding something like
*
conf.getBoolean("moreIndexingFilter.indexMimeTypeParts", true);*

to the method *setConf(Configuration conf)*

then use the boolean accordingly in the method addType()

The value for *moreIndexingFilter.indexMimeTypeParts *can then be specified
like any other Nutch param i.e. using nutch-site.xml or on the command line
with -D

Please submit a patch for both 1.2 and the trunk

Thanks

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Mime type via index-more plugin

Posted by Markus Jelsma <ma...@buyways.nl>.

Julien,

I've filed an issue [1], but i cannot, at this moment, provide a patch that 
enables configuration of this feature. I did disable it in my check out 
though.
Perhaps someone could give a pointer on how to read a configuration setting 
for a plug-in and where to store the setting (Nutch config or plugin.xml) and 
i might actually write my first Java code again since four years!

[1]: https://issues.apache.org/jira/browse/NUTCH-901

Cheers,

On Wednesday 08 September 2010 11:57:10 Julien Nioche wrote:
> Hi Markus,
> 
> Your analysis is correct, see the comments in the MoreIndexingFilter
> 
>    * <p>
>    * Add Content-Type and its primaryType and subType add contentType,
>    * primaryType and subType to field "type" as un-stored, indexed and
>    * un-tokenized, so that search results can be confined by contentType or
>  its * primaryType or its subType.
>    * </p>
>    * <p>
>    * For example, if contentType is application/vnd.ms-powerpoint, search
>  can be * done with one of the following qualifiers
>    * type:application/vnd.ms-powerpoint type:application
>  type:vnd.ms-powerpoint * all case insensitive. The query filter is
>  implemented in
>    * {@link TypeQueryFilter}.
>    * </p>
> 
> There is currently no way of configuring the behaviour of the
>  MoreIndexingFilter but doing so would be trivial and we could keep the
>  current behaviour by default i.e. add subparts and have schema with 
>  multiValued="true" but add a parameter to be able to index only the full
>  type. The schema would remain the same - you could of course modify it
>  locally if you wished to do so.
> 
> Would you like to open a JIRA and send a patch for this?
> 
> Thanks
> 
> Julien
> 
> > Hi,
> >
> > I'm testing the index-more plug-in but, to my surprise, it is defined as
> > a multi valued field in the shipped Solr schema configuration! Since when
> > do files have more than one mime type?
> >
> > Well, they don't! It seems the plug-in splits mime types by slash and
> >  exports three terms per document, 'text', 'plain' and 'text/plain'. But,
> >  in my opinion, the plug-in should just export 'text/plain', i can do the
> >  tokenization myself and i don't want my index to be polluted with this
> >  non- information =)
> >
> > Anyone knows how to configure the index-more plug-in? The wiki isn't very
> > helpful.
> >
> > Cheers,
> >
> > Markus Jelsma - Technisch Architect - Buyways BV
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> 
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Mime type via index-more plugin

Posted by Julien Nioche <li...@gmail.com>.

Hi Markus,

Your analysis is correct, see the comments in the MoreIndexingFilter

   * <p>
   * Add Content-Type and its primaryType and subType add contentType,
   * primaryType and subType to field "type" as un-stored, indexed and
   * un-tokenized, so that search results can be confined by contentType or
its
   * primaryType or its subType.
   * </p>
   * <p>
   * For example, if contentType is application/vnd.ms-powerpoint, search
can be
   * done with one of the following qualifiers
   * type:application/vnd.ms-powerpoint type:application
type:vnd.ms-powerpoint
   * all case insensitive. The query filter is implemented in
   * {@link TypeQueryFilter}.
   * </p>

There is currently no way of configuring the behaviour of the
MoreIndexingFilter but doing so would be trivial and we could keep the
current behaviour by default i.e. add subparts and have schema with
multiValued="true" but add a parameter to be able to index only the full
type. The schema would remain the same - you could of course modify it
locally if you wished to do so.

Would you like to open a JIRA and send a patch for this?

Thanks

Julien
-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

On 8 September 2010 10:49, Markus Jelsma <ma...@buyways.nl> wrote:

> I've checked the MoreIndexingFilter sources and my suspicions were right,
> it
> really splits the input in the getParts method. I'd love to have this
> removed
> and committed, but i guess more work is needed to keep it compatible such
> as
> tokenizing it to keep it searchable, which would require an schema change
> and
> all kinds of trouble, isn't it?
>
>
> On Wednesday 08 September 2010 11:27:18 Markus Jelsma wrote:
> > Hi,
> >
> > I'm testing the index-more plug-in but, to my surprise, it is defined as
> a
> > multi valued field in the shipped Solr schema configuration! Since when
> do
> > files have more than one mime type?
> >
> > Well, they don't! It seems the plug-in splits mime types by slash and
> >  exports three terms per document, 'text', 'plain' and 'text/plain'. But,
> >  in my opinion, the plug-in should just export 'text/plain', i can do the
> >  tokenization myself and i don't want my index to be polluted with this
> >  non- information =)
> >
> > Anyone knows how to configure the index-more plug-in? The wiki isn't very
> > helpful.
> >
> > Cheers,
> >
> > Markus Jelsma - Technisch Architect - Buyways BV
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> >
>
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>

Re: Mime type via index-more plugin

Posted by Markus Jelsma <ma...@buyways.nl>.

I've checked the MoreIndexingFilter sources and my suspicions were right, it 
really splits the input in the getParts method. I'd love to have this removed 
and committed, but i guess more work is needed to keep it compatible such as 
tokenizing it to keep it searchable, which would require an schema change and 
all kinds of trouble, isn't it?


On Wednesday 08 September 2010 11:27:18 Markus Jelsma wrote:
> Hi,
> 
> I'm testing the index-more plug-in but, to my surprise, it is defined as a
> multi valued field in the shipped Solr schema configuration! Since when do
> files have more than one mime type?
> 
> Well, they don't! It seems the plug-in splits mime types by slash and
>  exports three terms per document, 'text', 'plain' and 'text/plain'. But,
>  in my opinion, the plug-in should just export 'text/plain', i can do the
>  tokenization myself and i don't want my index to be polluted with this
>  non- information =)
> 
> Anyone knows how to configure the index-more plug-in? The wiki isn't very
> helpful.
> 
> Cheers,
> 
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Mime type via index-more plugin

Posted by Markus Jelsma <ma...@buyways.nl>.

I'll try to give it a shot this week for the 1.2 branch and trunk if it isn't 
too different. It shouldn't be too hard and Julien's explanation on how to 
read the configuration makes a lot of sense.


On Wednesday 08 September 2010 16:37:29 Mattmann, Chris A (388J) wrote:
> Hi Markus,
> 
> > Interesting! But can the mime extractor return more than one type for a
> > given file in Nutch?
> 
> Sure, Nutch metadata is a named Field->multi-value structure so a file (or
> piece of content) can certainly have more than 1 type.
> 
> > I see, but in that case it would be helpful if the canonical, top and sub
> > types have their own field which would also give more meaning to the
> > whole. The way it works now results in a real nasty mess when faceting on
> > the type field.
> 
> I hear ya! Though I guess it's a mess from your perspective. From mine, it
> is nice to be able to see things like:
> 
> Mime Type:
>   text (720)
>   plain (77)
>   text/plain (250)
>   xml (235)
> ...
> 
> Faceting using the primary and sub types works fine for me.
> 
> > What would be a good (configurable) improvement? Just adding the option
> > to disable the split? Or also add an option that spits out up to three
> > distinct fields?
> 
> I think that both of your suggestions are great improvements and we can
> include a patch to make each configurable.
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Mime type via index-more plugin

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Markus,

> Interesting! But can the mime extractor return more than one type for a given
> file in Nutch?

Sure, Nutch metadata is a named Field->multi-value structure so a file (or
piece of content) can certainly have more than 1 type.

> I see, but in that case it would be helpful if the canonical, top and sub
> types have their own field which would also give more meaning to the whole.
> The way it works now results in a real nasty mess when faceting on the type
> field.

I hear ya! Though I guess it's a mess from your perspective. From mine, it
is nice to be able to see things like:

Mime Type:
  text (720)
  plain (77)
  text/plain (250)
  xml (235)
...

Faceting using the primary and sub types works fine for me.

> 
> What would be a good (configurable) improvement? Just adding the option to
> disable the split? Or also add an option that spits out up to three distinct
> fields?

I think that both of your suggestions are great improvements and we can
include a patch to make each configurable.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Mime type via index-more plugin

Posted by Markus Jelsma <ma...@buyways.nl>.

Hello Chris,

On Wednesday 08 September 2010 16:17:30 Mattmann, Chris A (388J) wrote:
> Hi Markus,
> 
> In fact, there are plenty of times that files have > 1 mime type. There is
>  an entire classification scheme from IANA that defines parent-child
>  relationships between mime type (such as the notion that text/xml is a
>  descendant of text/plain).

Interesting! But can the mime extractor return more than one type for a given 
file in Nutch?

> 
> The current index-more plugin splits up mime types into the top level type
>  (e.g., "text"), its sub-type (e.g., "plain"), and its canonical type
>  ("text/plain"), as intended.

I see, but in that case it would be helpful if the canonical, top and sub 
types have their own field which would also give more meaning to the whole. 
The way it works now results in a real nasty mess when faceting on the type 
field.

What would be a good (configurable) improvement? Just adding the option to 
disable the split? Or also add an option that spits out up to three distinct 
fields?


Cheers
M.

> 
> Cheers,
> Chris
> 
> 
> 
> 
> On 9/8/10 2:27 AM, "Markus Jelsma" <ma...@buyways.nl> wrote:
> 
> Hi,
> 
> I'm testing the index-more plug-in but, to my surprise, it is defined as a
> multi valued field in the shipped Solr schema configuration! Since when do
> files have more than one mime type?
> 
> Well, they don't! It seems the plug-in splits mime types by slash and
>  exports three terms per document, 'text', 'plain' and 'text/plain'. But,
>  in my opinion, the plug-in should just export 'text/plain', i can do the
>  tokenization myself and i don't want my index to be polluted with this
>  non- information =)
> 
> Anyone knows how to configure the index-more plug-in? The wiki isn't very
> helpful.
> 
> Cheers,
> 
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
> 
> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Mime type via index-more plugin

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Markus,

In fact, there are plenty of times that files have > 1 mime type. There is an entire classification scheme from IANA that defines parent-child relationships between mime type (such as the notion that text/xml is a descendant of text/plain).

The current index-more plugin splits up mime types into the top level type (e.g., "text"), its sub-type (e.g., "plain"), and its canonical type ("text/plain"), as intended.

Cheers,
Chris

On 9/8/10 2:27 AM, "Markus Jelsma" <ma...@buyways.nl> wrote:

Hi,

I'm testing the index-more plug-in but, to my surprise, it is defined as a
multi valued field in the shipped Solr schema configuration! Since when do
files have more than one mime type?

Well, they don't! It seems the plug-in splits mime types by slash and exports
three terms per document, 'text', 'plain' and 'text/plain'. But, in my
opinion, the plug-in should just export 'text/plain', i can do the
tokenization myself and i don't want my index to be polluted with this non-
information =)

Anyone knows how to configure the index-more plug-in? The wiki isn't very
helpful.

Cheers,

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++