You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2010/05/25 23:13:31 UTC

[jira] Created: (TIKA-433) Tika + Hadoop

Tika + Hadoop
-------------

                 Key: TIKA-433
                 URL: https://issues.apache.org/jira/browse/TIKA-433
             Project: Tika
          Issue Type: New Feature
          Components: general
            Reporter: Grant Ingersoll
            Priority: Minor


Would be great to have a Tika contrib that took in an HDFS location with "rich" documents on it and an output format (or output processor) and converted the docs to XHTML or Solr or whatever.  Seems like it should be pretty straightforward to do on the Hadoop side of things.  Only tricky part, I suppose, is the output format and how to make that pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-433) Tika + Hadoop

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871544#action_12871544 ] 

Julien Nioche commented on TIKA-433:
------------------------------------

You can do that with [Behemoth|http://code.google.com/p/behemoth-pebble/] as it uses Tika on rich documents stored in a SequenceFile. There is an application in the Behemoth Sandbox which sends the annotated documents to SOLR and I am planning to write one to generate vectors for Mahout. The output format is a very straightforward standoff annotation model and that should fit for most applications.

> Tika + Hadoop
> -------------
>
>                 Key: TIKA-433
>                 URL: https://issues.apache.org/jira/browse/TIKA-433
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>            Reporter: Grant Ingersoll
>            Priority: Minor
>
> Would be great to have a Tika contrib that took in an HDFS location with "rich" documents on it and an output format (or output processor) and converted the docs to XHTML or Solr or whatever.  Seems like it should be pretty straightforward to do on the Hadoop side of things.  Only tricky part, I suppose, is the output format and how to make that pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-433) Tika + Hadoop

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871720#action_12871720 ] 

Grant Ingersoll commented on TIKA-433:
--------------------------------------

I think it makes sense as a Tika contrib, but that's not for me to determine.  It seems like it is more generally useful than just Behemoth or Mahout and fits well with what Tika does, along the lines of Tika's command line tool.  I don't have a use for Behemoth and don't wish to inject it into my dep. chain, whereas I am already using Tika and Hadoop.

> Tika + Hadoop
> -------------
>
>                 Key: TIKA-433
>                 URL: https://issues.apache.org/jira/browse/TIKA-433
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>            Reporter: Grant Ingersoll
>            Priority: Minor
>
> Would be great to have a Tika contrib that took in an HDFS location with "rich" documents on it and an output format (or output processor) and converted the docs to XHTML or Solr or whatever.  Seems like it should be pretty straightforward to do on the Hadoop side of things.  Only tricky part, I suppose, is the output format and how to make that pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-433) Tika + Hadoop

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871623#action_12871623 ] 

Julien Nioche commented on TIKA-433:
------------------------------------

Could do. I can't see a place in Tika's code for non-core contributions / sandbox though and am not sure that we want to burden Tika with Hadoop dependencies just for the sake of implementing this. My comment was actually more about the fact that functionalities such as the one you described *are* what Behemoth is all about i.e. processing documents in various ways using mapreduce, storing the data in a neutral, stand-off based implementation and using that in conjunction with projects such as SOLR or Mahout.
I suppose it also depends on whether Tika's focus should be on its API or provide a sandbox as well. WDYT?

> Tika + Hadoop
> -------------
>
>                 Key: TIKA-433
>                 URL: https://issues.apache.org/jira/browse/TIKA-433
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>            Reporter: Grant Ingersoll
>            Priority: Minor
>
> Would be great to have a Tika contrib that took in an HDFS location with "rich" documents on it and an output format (or output processor) and converted the docs to XHTML or Solr or whatever.  Seems like it should be pretty straightforward to do on the Hadoop side of things.  Only tricky part, I suppose, is the output format and how to make that pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-433) Tika + Hadoop

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871726#action_12871726 ] 

Jukka Zitting commented on TIKA-433:
------------------------------------

We could easily add a separate tika-hadoop component for this.

> Tika + Hadoop
> -------------
>
>                 Key: TIKA-433
>                 URL: https://issues.apache.org/jira/browse/TIKA-433
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>            Reporter: Grant Ingersoll
>            Priority: Minor
>
> Would be great to have a Tika contrib that took in an HDFS location with "rich" documents on it and an output format (or output processor) and converted the docs to XHTML or Solr or whatever.  Seems like it should be pretty straightforward to do on the Hadoop side of things.  Only tricky part, I suppose, is the output format and how to make that pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-433) Tika + Hadoop

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917390#action_12917390 ] 

Grant Ingersoll commented on TIKA-433:
--------------------------------------

I've taken this offline and am going to put it up as a separate project.  It's pretty close to being usable, but am doing some testing.  I'll post a link when it is ready.

> Tika + Hadoop
> -------------
>
>                 Key: TIKA-433
>                 URL: https://issues.apache.org/jira/browse/TIKA-433
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>            Reporter: Grant Ingersoll
>            Priority: Minor
>
> Would be great to have a Tika contrib that took in an HDFS location with "rich" documents on it and an output format (or output processor) and converted the docs to XHTML or Solr or whatever.  Seems like it should be pretty straightforward to do on the Hadoop side of things.  Only tricky part, I suppose, is the output format and how to make that pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-433) Tika + Hadoop

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917384#action_12917384 ] 

Jukka Zitting commented on TIKA-433:
------------------------------------

Anyone working on a patch for this? If not, I'd be inclined to resolve this as Won't Fix. It sounds like such a component would require more Hadoop than Tika expertise, and would therefore be better maintained within Hadoop or as a separate external project.

> Tika + Hadoop
> -------------
>
>                 Key: TIKA-433
>                 URL: https://issues.apache.org/jira/browse/TIKA-433
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>            Reporter: Grant Ingersoll
>            Priority: Minor
>
> Would be great to have a Tika contrib that took in an HDFS location with "rich" documents on it and an output format (or output processor) and converted the docs to XHTML or Solr or whatever.  Seems like it should be pretty straightforward to do on the Hadoop side of things.  Only tricky part, I suppose, is the output format and how to make that pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-433) Tika + Hadoop

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871742#action_12871742 ] 

Yonik Seeley commented on TIKA-433:
-----------------------------------

>From the peanut gallery, Lucene has gone down the contrib path in the past, and I wouldn't recommend it.  There are tons of places to host projects these days, and it may make more sense to be hosted as a separate project.


> Tika + Hadoop
> -------------
>
>                 Key: TIKA-433
>                 URL: https://issues.apache.org/jira/browse/TIKA-433
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>            Reporter: Grant Ingersoll
>            Priority: Minor
>
> Would be great to have a Tika contrib that took in an HDFS location with "rich" documents on it and an output format (or output processor) and converted the docs to XHTML or Solr or whatever.  Seems like it should be pretty straightforward to do on the Hadoop side of things.  Only tricky part, I suppose, is the output format and how to make that pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-433) Tika + Hadoop

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917386#action_12917386 ] 

Chris A. Mattmann commented on TIKA-433:
----------------------------------------

+1, Jukka.

> Tika + Hadoop
> -------------
>
>                 Key: TIKA-433
>                 URL: https://issues.apache.org/jira/browse/TIKA-433
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>            Reporter: Grant Ingersoll
>            Priority: Minor
>
> Would be great to have a Tika contrib that took in an HDFS location with "rich" documents on it and an output format (or output processor) and converted the docs to XHTML or Solr or whatever.  Seems like it should be pretty straightforward to do on the Hadoop side of things.  Only tricky part, I suppose, is the output format and how to make that pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-433) Tika + Hadoop

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871616#action_12871616 ] 

Grant Ingersoll commented on TIKA-433:
--------------------------------------

Does that mean you are going to extract it from Behemoth and contribute it here?  It seems like it would be a generally useful thing to do.

> Tika + Hadoop
> -------------
>
>                 Key: TIKA-433
>                 URL: https://issues.apache.org/jira/browse/TIKA-433
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>            Reporter: Grant Ingersoll
>            Priority: Minor
>
> Would be great to have a Tika contrib that took in an HDFS location with "rich" documents on it and an output format (or output processor) and converted the docs to XHTML or Solr or whatever.  Seems like it should be pretty straightforward to do on the Hadoop side of things.  Only tricky part, I suppose, is the output format and how to make that pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-433) Tika + Hadoop

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917391#action_12917391 ] 

Chris A. Mattmann commented on TIKA-433:
----------------------------------------

Thanks Grant, sounds cool!

> Tika + Hadoop
> -------------
>
>                 Key: TIKA-433
>                 URL: https://issues.apache.org/jira/browse/TIKA-433
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>            Reporter: Grant Ingersoll
>            Priority: Minor
>
> Would be great to have a Tika contrib that took in an HDFS location with "rich" documents on it and an output format (or output processor) and converted the docs to XHTML or Solr or whatever.  Seems like it should be pretty straightforward to do on the Hadoop side of things.  Only tricky part, I suppose, is the output format and how to make that pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.