You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2007/10/09 02:26:50 UTC

[jira] Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework

     [ https://issues.apache.org/jira/browse/NUTCH-562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann closed NUTCH-562.
-----------------------------------


- Patch applied to trunk in r583016

> Port mime type framework to use Tika mime detection framework
> -------------------------------------------------------------
>
>                 Key: NUTCH-562
>                 URL: https://issues.apache.org/jira/browse/NUTCH-562
>             Project: Nutch
>          Issue Type: Improvement
>          Components: mime_type_detector
>    Affects Versions: 1.0.0
>         Environment: Mac Book Pro, Intel Core Duo 2.0 Ghz, 2.0 GB RAM, Mac OS X 10.4 although improvement is indep of env
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: NUTCH-562.Mattmann.patch.txt, tika-0.1-dev.jar
>
>
> With Tika (http://incubator.apache.org/tika/) nearing  a stable 0.1 release candidate, I think it would be a good time to patch Nutch to use Tika's mime detection system (an improvement over the existing Nutch one written primarily by Jerome). Tika's mime system is based on the mime system from Freedesktop.org and includes several improvements over the existing Nutch mime system such as:
> 1. reliable XML-based content detection (a clear issue plaguing Nutch for some time now), ability to delineate between RSS, XML, ATOM, etc.
> 2. mime magic pattern matching, including support for multiple patterns
> 3. glob pattern matches (ability to support > 1)
> I'll get together a patch and then attach it to the list once it's relatively stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework

Posted by Dennis Kubes <ku...@apache.org>.
UWhat bothers me here is not the time to commit, although I agree 
probably should have been longer than 1 day, but that AFAIK there is 
very little documentation about Tika.  That being said, both Chris and 
Sami are committers for Tika.  So if they both feel that Tika is mature 
enough to use, and can help answer the inevitable question on the Nutch 
list about it, then I feel it would be okay to keep the changes.

Dennis Kubes



Andrzej Bialecki wrote:
> Chris A. Mattmann (JIRA) wrote:
>>      [ 
>> https://issues.apache.org/jira/browse/NUTCH-562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel 
>> ]
>>
>> Chris A. Mattmann closed NUTCH-562.
>> -----------------------------------
>>
>>
>> - Patch applied to trunk in r583016
> 
> I think this issue didn't get enough attention before it was committed. 
> I agree with the direction of this patch - functionality-wise the mime 
> type detector in Tika is clearly superior to the one that we have now in 
> Nutch - but I feel that the use of an external framework, which is not 
> yet released, should be discussed first, and the proper working of the 
> patch should be confirmed by other users. There was too little time to 
> do this before the commit.
> 
> I vote for reverting this patch, unless there is an overall consensus 
> among Nutch developers that it's ok to keep it as it is - on one hand 
> considering the added functionality and simplification of Nutch code, 
> and on the other hand considering the (lack of) maturity of Tika.
> 

Re: [jira] Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Guys,

>> 
>> I vote for reverting this patch, unless there is an overall consensus
>> among Nutch developers that it's ok to keep it as it is - on one hand
>> considering the added functionality and simplification of Nutch code,
>> and on the other hand considering the (lack of) maturity of Tika.
> 
> I agree with Andrzej here. I would have waited a bit more before rushing
> into this. Because at this point (where no Tika releases have been made)
> it might (even though it does not look like it right now) even be
> possible that the project will be retired without any releases at all.

I'm not out for beating a dead horse here, but the thought comes to mind:
what about the vitality of the code as it exists within the Nutch code base?
When was the last time anybody at all worked on the mime system? It was
pioneered by Jerome, but he's been largely inactive as a committer for more
than a year now, and it doesn't look like that's going to change.

I ported what was largely Nutch's mime system, with Jerome's improvements to
Tika, where the code is actively being developed, by me (and vetted by the
other *active* members of the team) -- in contrast to Nutch. As a developer,
I don't want to maintain the code in both places, but I'm willing to
maintain the Nutch use of and interface to Tika, which means that Nutch will
inherit the benefits using this approach. Being a member of the Nutch
community for almost 2 years now, I can't tell you how many times people
have asked for Nutch to be able to reliably detect XML content. This is
reified in the form of a number of different JIRA issues that reference that
deficiency, that are for all intents and purposes, not being worked on at
all. 

I'm all for following the process, and so forth, but at the same time, I
think the Nutch community needs to take a serious look at itself with
regards to the "sacred" nature of the trunk, which we currently treat with a
large amount of sensitivity, etc. However, the trunk as it stands on other
projects (and of course, I'm bias, but I use my work as an example and also
say something like Tika), the trunk is not something that is expected to be
"always working" and is regularly expected as somewhere where bugs can
exist, and where they can be fixed before a release is made. That's not the
way I feel on this project and quite honestly I think it stymies progress.

Finally, there is precedence for what I did with the Tika patch and making
its way into the Nutch. If I recall something very similar happened when
Hadoop came along and NDFS (at the time as it was called) and MapReduce made
their way into an external library, and Nutch was made to rely on that (at
the time) in-development library. This makes sense, because the folks
working on Hadoop were actively working on updates to the portion of the
code that Nutch relied upon, and all the developers that were interested in
that portion of the code started developing in that arena. I'm not
compariing Hadoop to Tika, but certainly there are some similarities here.

-Chris


______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



Re: [jira] Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework

Posted by Sami Siren <ss...@gmail.com>.
Andrzej Bialecki wrote:
> Chris A. Mattmann (JIRA) wrote:
>>      [
>> https://issues.apache.org/jira/browse/NUTCH-562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> ]
>>
>> Chris A. Mattmann closed NUTCH-562.
>> -----------------------------------
>>
>>
>> - Patch applied to trunk in r583016
> 
> I think this issue didn't get enough attention before it was committed.
> I agree with the direction of this patch - functionality-wise the mime
> type detector in Tika is clearly superior to the one that we have now in
> Nutch - but I feel that the use of an external framework, which is not
> yet released, should be discussed first, and the proper working of the
> patch should be confirmed by other users. There was too little time to
> do this before the commit.
> 
> I vote for reverting this patch, unless there is an overall consensus
> among Nutch developers that it's ok to keep it as it is - on one hand
> considering the added functionality and simplification of Nutch code,
> and on the other hand considering the (lack of) maturity of Tika.

I agree with Andrzej here. I would have waited a bit more before rushing
into this. Because at this point (where no Tika releases have been made)
it might (even though it does not look like it right now) even be
possible that the project will be retired without any releases at all.

-- 
 Sami Siren

Re: [jira] Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Folks,

 Either way is fine with me. I committed the patch for the following
reasons:

 1. Though the patch sat for around 36 hrs, the JIRA issue has been around
nearly 2 weeks, without any comment at all. I used this as a baseline for
relative interest in the patch. Though a patch file is ultimately the means
for which contributions are to be judged, I had pretty much laid out the
plan in the JIRA issue: port Nutch to use Tika mime system. Tika mime system
provides X, Y, Z that Nutch doesn't, etc. This described the ultimate intent
of the code that was soon to be reified.

 2. Similarity of Tika mime API to existing Nutch mime API. The core classes
of the API in both Tika and the old mime system in Nutch are 90% the same
(in some cass, like MimeTypes.java, the file is nearly identical). This fact
is not incidental: it's because Jerome wrote the majority of both code
bases. This made it easier for me to swallow that the API would work as
expected.

 3. My experience testing the patch in the case of small crawls against
subsets of the apache.org sites. I was primarily looking for 2 things:

  a. performance -- there wasn't a significant hit that I could notice while
observing crawl time anecdotally.

  b. effectiveness -- were mime types still being set in the metadata, were
the right parsers getting called, etc.? The answer here was "yes".

 I'm sure that this is more of a procedural issue than anything else.
Because of this I'm happy to revert the patch. My +1 for it in fact. Then
I'll happily await other folks to test it and provide feedback. I can't
promise I'll get to updating it and committing revised versions of it back
to the sources right away though: the rest of my week is actually very busy
(another reason for my desire to contribute the patch and commit it over the
past weekend -- it was the only time in the next week or so that I would
have to get it into the sources and to solve some issues that have been
plaguing Nutch for a while, e.g., reliable content type detection in the
case of XML/RDF/RSS files, etc.).

 In any case, let me know what you decide.

Chris


  


On 10/9/07 1:57 PM, "Andrzej Bialecki" <ab...@getopt.org> wrote:

> Chris A. Mattmann (JIRA) wrote:
>>      [ 
>> https://issues.apache.org/jira/browse/NUTCH-562?page=com.atlassian.jira.plugi
>> n.system.issuetabpanels:all-tabpanel ]
>> 
>> Chris A. Mattmann closed NUTCH-562.
>> -----------------------------------
>> 
>> 
>> - Patch applied to trunk in r583016
> 
> I think this issue didn't get enough attention before it was committed.
> I agree with the direction of this patch - functionality-wise the mime
> type detector in Tika is clearly superior to the one that we have now in
> Nutch - but I feel that the use of an external framework, which is not
> yet released, should be discussed first, and the proper working of the
> patch should be confirmed by other users. There was too little time to
> do this before the commit.
> 
> I vote for reverting this patch, unless there is an overall consensus
> among Nutch developers that it's ok to keep it as it is - on one hand
> considering the added functionality and simplification of Nutch code,
> and on the other hand considering the (lack of) maturity of Tika.




Re: [jira] Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework

Posted by Andrzej Bialecki <ab...@getopt.org>.
Chris A. Mattmann (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/NUTCH-562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> Chris A. Mattmann closed NUTCH-562.
> -----------------------------------
> 
> 
> - Patch applied to trunk in r583016

I think this issue didn't get enough attention before it was committed. 
I agree with the direction of this patch - functionality-wise the mime 
type detector in Tika is clearly superior to the one that we have now in 
Nutch - but I feel that the use of an external framework, which is not 
yet released, should be discussed first, and the proper working of the 
patch should be confirmed by other users. There was too little time to 
do this before the commit.

I vote for reverting this patch, unless there is an overall consensus 
among Nutch developers that it's ok to keep it as it is - on one hand 
considering the added functionality and simplification of Nutch code, 
and on the other hand considering the (lack of) maturity of Tika.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com