You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Eric Pugh (JIRA)" <ji...@apache.org> on 2007/07/02 22:54:04 UTC

[jira] Created: (SOLR-284) Parsing Rich Document Types

Parsing Rich Document Types
---------------------------

                 Key: SOLR-284
                 URL: https://issues.apache.org/jira/browse/SOLR-284
             Project: Solr
          Issue Type: New Feature
          Components: update
    Affects Versions: 1.3
            Reporter: Eric Pugh
             Fix For: 1.3


I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.

I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.

 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by Chris Hostetter <ho...@fucit.org>.

: I got the following error when try to launch solr in tomcat after applying
: patch SOLR-284

: java.lang.RuntimeException: org.apache.lucene.index.CorruptIndexException:
: Unknown format version: -4 at org.apache.solr.core.SolrCore.getSearcher(

that error indicates that the version of lucene you are using is incapable 
of opening indexes created by the version of lucene that created your 
index.

specificly: i'm pretty sure -4 is the format version for Lucene 2.3.X

did you by any chance apply the patch to an old version of Solr that was 
using Lucene 2.2 and then run it against an index you'd already made with 
a newer version of Solr?


-Hoss

Re: [jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by Chris Hostetter <ho...@fucit.org>.

: This patch is failing when i try to index large documents of size
: >20MB(mainly excel and pdf)

What is the error?  What shows up in your logs?

What is the value of "multipartUploadLimitInKB" in your solrconfig.xml say?



-Hoss

Re: [jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by pavan kumar donepudi <pa...@gmail.com>.

This patch is failing when i try to index large documents of size
>20MB(mainly excel and pdf)

Can anyone help?

Regards,
Pavan



On Wed, Mar 12, 2008 at 2:04 PM, pavan kumar donepudi <
pavan.donepudi@gmail.com> wrote:

> I got the following error when try to launch solr in tomcat after applying
> patch SOLR-284
>
> *message* *Severe errors in solr configuration. Check your log files for
> more detailed information on what may be wrong. If you want solr to continue
> after configuration errors, change:
> <abortOnConfigurationError>false</abortOnConfigurationError> in
> solrconfig.xml-------------------------------------------------------------
> java.lang.RuntimeException: org.apache.lucene.index.CorruptIndexException:
> Unknown format version: -4 at org.apache.solr.core.SolrCore.getSearcher(
> SolrCore.java:498) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:229)
> at org.apache.solr.core.SolrCore.getSolrCore(SolrCore.java:183) at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:70)
> at org.apache.catalina.core.ApplicationFilterConfig.getFilter(
> ApplicationFilterConfig.java:274) at
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(
> ApplicationFilterConfig.java:396) at
> org.apache.catalina.core.ApplicationFilterConfig.<init>(
> ApplicationFilterConfig.java:107) at
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3693)
> at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
> at org.apache.catalina.core.ContainerBase.addChildInternal(
> ContainerBase.java:760) at org.apache.catalina.core.ContainerBase.addChild
> (ContainerBase.java:740) at org.apache.catalina.core.StandardHost.addChild
> (StandardHost.java:525) at
> org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:626)
> at org.apache.catalina.startup.HostConfig.deployDescriptors(
> HostConfig.java:553) at org.apache.catalina.startup.HostConfig.deployApps(
> HostConfig.java:488) at org.apache.catalina.startup.HostConfig.start(
> HostConfig.java:1138) at
> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
> at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(
> LifecycleSupport.java:120) at org.apache.catalina.core.ContainerBase.start
> (ContainerBase.java:1022) at org.apache.catalina.core.StandardHost.start(
> StandardHost.java:719) at org.apache.catalina.core.ContainerBase.start(
> ContainerBase.java:1014) at org.apache.catalina.core.StandardEngine.start(
> StandardEngine.java:443) at org.apache.catalina.core.StandardService.start
> (StandardService.java:451) at
> org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at
> org.apache.catalina.startup.Catalina.start(Catalina.java:552) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(
> Method.java:597) at org.apache.catalina.startup.Bootstrap.start(
> Bootstrap.java:288) at org.apache.catalina.startup.Bootstrap.main(
> Bootstrap.java:413) Caused by:
> org.apache.lucene.index.CorruptIndexException: Unknown format version: -4
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:204) at
> org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:190) at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(
> SegmentInfos.java:610) at org.apache.lucene.index.IndexReader.open(
> IndexReader.java:185) at org.apache.lucene.index.IndexReader.open(
> IndexReader.java:148) at org.apache.solr.search.SolrIndexSearcher.<init>(
> SolrIndexSearcher.java:84) at org.apache.solr.core.SolrCore.getSearcher(
> SolrCore.java:489) ... 30 more *
>
> *description* *The server encountered an internal error (Severe errors in
> solr configuration. Check your log files for more detailed information on
> what may be wrong. If you want solr to continue after configuration errors,
> change: <abortOnConfigurationError>false</abortOnConfigurationError> in
> solrconfig.xml-------------------------------------------------------------
> java.lang.RuntimeException: org.apache.lucene.index.CorruptIndexException:
> Unknown format version: -4 at org.apache.solr.core.SolrCore.getSearcher(
> SolrCore.java:498) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:229)
> at org.apache.solr.core.SolrCore.getSolrCore(SolrCore.java:183) at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:70)
> at org.apache.catalina.core.ApplicationFilterConfig.getFilter(
> ApplicationFilterConfig.java:274) at
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(
> ApplicationFilterConfig.java:396) at
> org.apache.catalina.core.ApplicationFilterConfig.<init>(
> ApplicationFilterConfig.java:107) at
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3693)
> at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
> at org.apache.catalina.core.ContainerBase.addChildInternal(
> ContainerBase.java:760) at org.apache.catalina.core.ContainerBase.addChild
> (ContainerBase.java:740) at org.apache.catalina.core.StandardHost.addChild
> (StandardHost.java:525) at
> org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:626)
> at org.apache.catalina.startup.HostConfig.deployDescriptors(
> HostConfig.java:553) at org.apache.catalina.startup.HostConfig.deployApps(
> HostConfig.java:488) at org.apache.catalina.startup.HostConfig.start(
> HostConfig.java:1138) at
> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
> at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(
> LifecycleSupport.java:120) at org.apache.catalina.core.ContainerBase.start
> (ContainerBase.java:1022) at org.apache.catalina.core.StandardHost.start(
> StandardHost.java:719) at org.apache.catalina.core.ContainerBase.start(
> ContainerBase.java:1014) at org.apache.catalina.core.StandardEngine.start(
> StandardEngine.java:443) at org.apache.catalina.core.StandardService.start
> (StandardService.java:451) at
> org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at
> org.apache.catalina.startup.Catalina.start(Catalina.java:552) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(
> Method.java:597) at org.apache.catalina.startup.Bootstrap.start(
> Bootstrap.java:288) at org.apache.catalina.startup.Bootstrap.main(
> Bootstrap.java:413) Caused by:
> org.apache.lucene.index.CorruptIndexException: Unknown format version: -4
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:204) at
> org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:190) at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(
> SegmentInfos.java:610) at org.apache.lucene.index.IndexReader.open(
> IndexReader.java:185) at org.apache.lucene.index.IndexReader.open(
> IndexReader.java:148) at org.apache.solr.search.SolrIndexSearcher.<init>(
> SolrIndexSearcher.java:84) at org.apache.solr.core.SolrCore.getSearcher(
> SolrCore.java:489) ... 30 more ) that prevented it from fulfilling this
> request.*
> ------------------------------
> Apache Tomcat/6.0.7
>
> Can any one help??
>
> Regards,
> Pavan
>
> On Fri, Feb 15, 2008 at 8:14 PM, Juho-Matti Stenberg (JIRA) <
> jira@apache.org> wrote:
>
> >
> >    [
> > https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12569275#action_12569275]
> >
> > Juho-Matti Stenberg commented on SOLR-284:
> > ------------------------------------------
> >
> > I wrote a simple patch for RichDocumentUpdateHandler to accept
> > multivalued fields. Just POST the same field name multiple times, e.g.
> > category=TVs&category=Radios
> >
> > {code:title=RichDocumentRequestHandler.java.patch}
> > Index: RichDocumentRequestHandler.java
> > ===================================================================
> > --- RichDocumentRequestHandler.java     (revision 0)
> > +++ RichDocumentRequestHandler.java     (working copy)
> > @@ -211,7 +211,10 @@
> >          for (int i =0; i < fields.length;i++){
> >            String fieldName = fields[i].getName();
> >
> > -           builder.addField(fieldName,params.get(fieldName),1.0f);
> > +           String[] values = params.getParams(fieldName);
> > +           for(String value : values) {
> > +                   builder.addField(fieldName,value,1.0f);
> > +           }
> >
> >          }
> > {code}
> >
> > Seems to work for me.
> >
> > Best Regards,
> > Pompo
> >
> > > Parsing Rich Document Types
> > > ---------------------------
> > >
> > >                 Key: SOLR-284
> > >                 URL: https://issues.apache.org/jira/browse/SOLR-284
> > >             Project: Solr
> > >          Issue Type: New Feature
> > >          Components: update
> > >    Affects Versions: 1.3
> > >            Reporter: Eric Pugh
> > >             Fix For: 1.3
> > >
> > >         Attachments: libs.zip, rich.patch, source.zip, test-files.zip,
> > test.zip
> > >
> > >
> > > I have developed a RichDocumentRequestHandler based on the
> > CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or
> > PDF document into Solr.
> > > There is a wiki page with information here:
> > http://wiki.apache.org/solr/UpdateRichDocuments
> > >
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>

Re: [jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by pavan kumar donepudi <pa...@gmail.com>.

I got the following error when try to launch solr in tomcat after applying
patch SOLR-284

*message* *Severe errors in solr configuration. Check your log files for
more detailed information on what may be wrong. If you want solr to continue
after configuration errors, change:
<abortOnConfigurationError>false</abortOnConfigurationError> in
solrconfig.xml -------------------------------------------------------------
java.lang.RuntimeException: org.apache.lucene.index.CorruptIndexException:
Unknown format version: -4 at org.apache.solr.core.SolrCore.getSearcher(
SolrCore.java:498) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:229)
at org.apache.solr.core.SolrCore.getSolrCore(SolrCore.java:183) at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:70)
at org.apache.catalina.core.ApplicationFilterConfig.getFilter(
ApplicationFilterConfig.java:274) at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(
ApplicationFilterConfig.java:396) at
org.apache.catalina.core.ApplicationFilterConfig.<init>(
ApplicationFilterConfig.java:107) at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3693)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
at org.apache.catalina.core.ContainerBase.addChildInternal(
ContainerBase.java:760) at org.apache.catalina.core.ContainerBase.addChild(
ContainerBase.java:740) at org.apache.catalina.core.StandardHost.addChild(
StandardHost.java:525) at
org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:626)
at org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:553)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:488) at
org.apache.catalina.startup.HostConfig.start(HostConfig.java:1138) at
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(
LifecycleSupport.java:120) at org.apache.catalina.core.ContainerBase.start(
ContainerBase.java:1022) at org.apache.catalina.core.StandardHost.start(
StandardHost.java:719) at org.apache.catalina.core.ContainerBase.start(
ContainerBase.java:1014) at org.apache.catalina.core.StandardEngine.start(
StandardEngine.java:443) at org.apache.catalina.core.StandardService.start(
StandardService.java:451) at org.apache.catalina.core.StandardServer.start(
StandardServer.java:710) at org.apache.catalina.startup.Catalina.start(
Catalina.java:552) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method) at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:39) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(
Method.java:597) at org.apache.catalina.startup.Bootstrap.start(
Bootstrap.java:288) at org.apache.catalina.startup.Bootstrap.main(
Bootstrap.java:413) Caused by: org.apache.lucene.index.CorruptIndexException:
Unknown format version: -4 at org.apache.lucene.index.SegmentInfos.read(
SegmentInfos.java:204) at org.apache.lucene.index.IndexReader$1.doBody(
IndexReader.java:190) at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:610)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:185) at
org.apache.lucene.index.IndexReader.open(IndexReader.java:148) at
org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:84)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:489) ... 30 more
*

*description* *The server encountered an internal error (Severe errors in
solr configuration. Check your log files for more detailed information on
what may be wrong. If you want solr to continue after configuration errors,
change: <abortOnConfigurationError>false</abortOnConfigurationError> in
solrconfig.xml -------------------------------------------------------------
java.lang.RuntimeException: org.apache.lucene.index.CorruptIndexException:
Unknown format version: -4 at org.apache.solr.core.SolrCore.getSearcher(
SolrCore.java:498) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:229)
at org.apache.solr.core.SolrCore.getSolrCore(SolrCore.java:183) at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:70)
at org.apache.catalina.core.ApplicationFilterConfig.getFilter(
ApplicationFilterConfig.java:274) at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(
ApplicationFilterConfig.java:396) at
org.apache.catalina.core.ApplicationFilterConfig.<init>(
ApplicationFilterConfig.java:107) at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3693)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
at org.apache.catalina.core.ContainerBase.addChildInternal(
ContainerBase.java:760) at org.apache.catalina.core.ContainerBase.addChild(
ContainerBase.java:740) at org.apache.catalina.core.StandardHost.addChild(
StandardHost.java:525) at
org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:626)
at org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:553)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:488) at
org.apache.catalina.startup.HostConfig.start(HostConfig.java:1138) at
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(
LifecycleSupport.java:120) at org.apache.catalina.core.ContainerBase.start(
ContainerBase.java:1022) at org.apache.catalina.core.StandardHost.start(
StandardHost.java:719) at org.apache.catalina.core.ContainerBase.start(
ContainerBase.java:1014) at org.apache.catalina.core.StandardEngine.start(
StandardEngine.java:443) at org.apache.catalina.core.StandardService.start(
StandardService.java:451) at org.apache.catalina.core.StandardServer.start(
StandardServer.java:710) at org.apache.catalina.startup.Catalina.start(
Catalina.java:552) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method) at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:39) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(
Method.java:597) at org.apache.catalina.startup.Bootstrap.start(
Bootstrap.java:288) at org.apache.catalina.startup.Bootstrap.main(
Bootstrap.java:413) Caused by: org.apache.lucene.index.CorruptIndexException:
Unknown format version: -4 at org.apache.lucene.index.SegmentInfos.read(
SegmentInfos.java:204) at org.apache.lucene.index.IndexReader$1.doBody(
IndexReader.java:190) at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:610)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:185) at
org.apache.lucene.index.IndexReader.open(IndexReader.java:148) at
org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:84)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:489) ... 30 more
) that prevented it from fulfilling this request.*
------------------------------
Apache Tomcat/6.0.7

Can any one help??

Regards,
Pavan
On Fri, Feb 15, 2008 at 8:14 PM, Juho-Matti Stenberg (JIRA) <ji...@apache.org>
wrote:

>
>    [
> https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12569275#action_12569275]
>
> Juho-Matti Stenberg commented on SOLR-284:
> ------------------------------------------
>
> I wrote a simple patch for RichDocumentUpdateHandler to accept multivalued
> fields. Just POST the same field name multiple times, e.g.
> category=TVs&category=Radios
>
> {code:title=RichDocumentRequestHandler.java.patch}
> Index: RichDocumentRequestHandler.java
> ===================================================================
> --- RichDocumentRequestHandler.java     (revision 0)
> +++ RichDocumentRequestHandler.java     (working copy)
> @@ -211,7 +211,10 @@
>          for (int i =0; i < fields.length;i++){
>            String fieldName = fields[i].getName();
>
> -           builder.addField(fieldName,params.get(fieldName),1.0f);
> +           String[] values = params.getParams(fieldName);
> +           for(String value : values) {
> +                   builder.addField(fieldName,value,1.0f);
> +           }
>
>          }
> {code}
>
> Seems to work for me.
>
> Best Regards,
> Pompo
>
> > Parsing Rich Document Types
> > ---------------------------
> >
> >                 Key: SOLR-284
> >                 URL: https://issues.apache.org/jira/browse/SOLR-284
> >             Project: Solr
> >          Issue Type: New Feature
> >          Components: update
> >    Affects Versions: 1.3
> >            Reporter: Eric Pugh
> >             Fix For: 1.3
> >
> >         Attachments: libs.zip, rich.patch, source.zip, test-files.zip,
> test.zip
> >
> >
> > I have developed a RichDocumentRequestHandler based on the
> CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or
> PDF document into Solr.
> > There is a wiki page with information here:
> http://wiki.apache.org/solr/UpdateRichDocuments
> >
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-284:
---------------------------------

    Attachment: SOLR-282.patch

Separated out ID generation to make it easier to override.

I think this is pretty close to being ready to commit, so please review.  I'm wrapped up next week, so I probably won't commit until the end of next week (after 11/21) so please review and provide feedback.  Also, Tika is about to release 0.2, so I may just wait to add that in.

Added in NOTICE and LICENSE information.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-282.patch, SOLR-282.patch, SOLR-282.patch, SOLR-282.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627882#action_12627882 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

A couple of Tika things:

I glanced at Tika yesterday, and it looks like switching this patch over to it wouldn't be too hard. (The only thing half-worthy of note is that org.apache.tika.parser.Parser.parse outputs XHTML [via a SAX interface], which we would probably then need to turn into plaintext.) I haven't yet looked into Eric's code to see if it does anything special that Tika doesn't do.

I also noticed something else, though. Earlier comments say that Nutch uses Tika, but when I looked through Nutch trunk this seemed to only sort of be the case. In particular, Nutch definitely uses the stuff in the org.apache.tika.mime namepsace, to do things like auto-detect content types, but it doesn't seem to use the stuff in org.apache.tika.parser to do the actual document parsing; instead, it uses its own separate org.apache.nutch.parse.Parser class (and subclasses thereof). For example, org.apache.nutch.parse.html.HtmlParser does not delegate to org.apache.tika.parser.html.HtmlParser but rather does its own direct manipulation of the tagsoup and/or nekohtml libraries. (Things are similar with the Nutch PDF parser.) Nor does there seem to be an alternative class along the lines of org.apache.nutch.parse.TikaBasedParserThatCanParseLotsOfDifferentContentTypesIncludingHtml. And the string "org.apache.tika.parser" doesn't seem to occur in the Nutch source.

I'm wondering if anyone knows why Nutch does not seem to make use of all of Tika's functionality. Are they planning to switch everything over to Tika eventually?


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Pugh updated SOLR-284:
---------------------------

    Attachment:     (was: rich.patch)

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726123#action_12726123 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

{quote}
bq. My only request is that, if you're changing how field mapping works and maybe removing ext.ignore.und.fl, you make sure it stays easy to say, "Tika, I don't care about any of your parsed metadata.

Map unknown fields to an ignored fieldtype.
uprefix=ignored_
{quote}

That seems fine.

Tangentially, I wonder how fast Tika's metadata extraction is, compared to its main body text extraction. If the latter doesn't dwarf the former, there might be value in adding a "Solr, don't even ask Tika to calculate for metadata at all; just have it extract the body text" flag; this could potentially speed things up for people that don't need the metadata. Maybe it would make sense to benchmark things before adding such a flag, though. I also don't have a good sense of how many people will want to use the metadata feature vs how many don't.


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-284:
------------------------------

    Attachment: rich.patch

Replacing rich.patch. The new one:

1) Rolls together into one handy package all of these:

  * the old rich.patch
  * the contents of source.zip and test.zip
  * Pompo's multivalued fields patch.

Note: It does *not* include the contents of libs.zip or test-files.zip. I'm not sure what the protocol is around those larger files.

Note: The old rich.patch included a change to Config.java that  searched for an alternative config file in "src/test/test-files/solr/conf/". I've removed that change because I think it's debugging code that we don't want in an official patch. Let me know if I'm wrong, though.

2) Makes things work against the latest revision in trunk, r646483. (It had stopped working with the latest version.)

I haven't added any new test cases, but the old ones all pass.

I grant my modifications to ASF according to the Apache License. Someone might want to check that the underlying contributions have been appropriately licensed as well.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654063#action_12654063 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

Committed revision 723977.

Committed Chris' patch, w/ the modification that I put the ext prefix on the resource.name and stream.type.

I also added a ext.metadata.prefix option, which can be used to map the Tika metadata to a dynamic Field, as Erik described above.

See the Wiki page for details: http://wiki.apache.org/solr/ExtractingRequestHandler

Thanks for everyone's input and work!

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650353#action_12650353 ] 

Hoss Man commented on SOLR-284:
-------------------------------

bq. if Tika returns a metadata field and you haven't made an explicit mapping from the Tika fieldname to your Solr fieldname, then Solr will throw an exception and your document add will fail. This doesn't seem sound very robust for a production environment, unless Tika will only ever use a finite list of metadata field names.

I'm not familiar with the state of the patch, but i'm assuming that (by default) all of the metadata fields produced by tika have a common naming convention -- either in terms of a common prefix or a common suffix.  in which case people can always make a dynamicField declaration to ignore all metadata fields not already explicitly declared.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Pugh updated SOLR-284:
---------------------------

    Attachment: test.zip

test code, this time with granted license!

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662696#action_12662696 ] 

Hoss Man commented on SOLR-284:
-------------------------------

bq. I put in the code b/c I figured it was better to generate an ID than to outright reject the document,

Hmmm ... that means that if i have a schema with a uniqueKey field, and i forget to specify a uniqueKey value when indexing my document, the handler will "silently succeed" in adding a document with a key i have no control over instead of failing in a way that will make me aware of my mistake -- and i have no way of configuring solr to prevent that kind of "silent success"

If i wanted that behavior, i could configure the schema with a UUIDFIeld as the uniqueKey and take advantage of the default.  but as it is now i have no way to prevent it.

I would think consistency and flexibility is more important, and remove that "generateId" functionality along the lines of Lance's suggestion.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646347#action_12646347 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

FYI, I intend to integrate Tika now that it has graduated from incubation and is a full-fledged Lucene sub-project.  I will do my best to be back-compatible with this patch, but make no guarantees as of know, since I have not reviewed this patch in a long time.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Pugh updated SOLR-284:
---------------------------

    Attachment:     (was: test.zip)

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Pugh updated SOLR-284:
---------------------------

    Attachment: test-files.zip

test files to go in test/test-files for unit testing.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-284:
------------------------------

    Attachment: SOLR-284.patch

OK, here's my first crack at cleaning things up a little before release.  Changes:
- there were no tests for XML attribute indexing.
- capture had no unit tests
- boost has no unit tests
- ignoring unknown fields had no unit test
- metadata prefix had no unit test
- logging ignored fields at the INFO level for each document loaded is too verbose
- removed handling of undeclared fields and let downstream components
  handle this.
- avoid the String catenation code for single valued fields when Tika only
  produces a single value (for performance)
- remove multiple literal detection handling for single valued fields - let a downstream component handle it
- map literal values just as one would with generated metadata, since the user may be just supplying the extra metadata.  also apply transforms (date formatting currently)
- fixed a bug where null field values were being added (and later dropped by Solr... hence it was never caught).
- avoid catching previously thrown SolrExceptions... let them fly through
- removed some unused code (id generation, etc)
- added lowernames option to map field names to lowercase/underscores
- switched builderStack from synchronized Stack to LinkedList 
- fixed a bug that caused content to be appended with no whitespace in between
- made extracting request handler lazy loading in example config
- added ignored_ and attr_ dynamic fields in example schema

Interface:
{code}
The default field is always "content" - use map to change it to something else
lowernames=true/false  // if true, map names like Content-Type to content_type
map.<fname>=<target_field>
boost.<fname>=<boost>
literal.<fname>=<literal_value>
xpath=<xpath_expr>  - only generate content for the matching xpath expr
extractOnly=true/false - if true, just return the extracted content
capture=<xml_element_name>  // separate out these elements 
captureAttr=<xml_element_name>   // separate out the attributes for these elements
uprefix=<prefix>  // unknown field prefix - any unknown fields will be prepended with this value
stream.type
resource.name
{code}

To try and make things more uniform, all fields, whether "content" or metadata or attributes or literals, all go through the same process.
1) map to lowercase if lowernames=true
2) apply map.field rules
3) if the resulting field is unknown, prefix it with uprefix

Hopefully people will agree that this is an improvement in general.  I think in the future we'll need more advanced options, esp around dealing with links in HTML and more powerful xpath constructs, but that's for after 1.4 IMO.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726115#action_12726115 ] 

Eric Pugh commented on SOLR-284:
--------------------------------

I am out of the office 6/29 - 6/30.  For urgent issues, please contact
Jason Hull at jhull@opensourceconnections.com or phone at (434)
409-8451.


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627333#action_12627333 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

While we're on the subject of breaking changes, I'm now seeing some merit in replacing the fieldnames parameter with a field-specifying prefix.

Currently when you want to set a non-body field, you introduce the field name in the fieldnames parameter and then specify its value in another parameter, like so:

   /update/rich/...fieldnames=f1,f2,f3&f1=val1&f2=val2&f3=val3

The alternative would be to to signal the fields f1, f2, and f3 by a field prefix, like so:

  /update/rich/...f.f1=val1&f.f2=val2&f.f3=val3

Because the f prefix says "this is a field", there's no need for the fieldnames parameter.

This isn't an Earth-shattering improvement, but there are three things I like about it:

1. The URLs are shorter

2. If you rename a field (e.g. rename f3 to g3), you can't accidentally half-update the URL in the client code, like this:

  /update/rich/...fieldnames=f1,f2,g3&f1=val1&f2=val2&f3=val3

3. Currently there are certain reserved words (e.g. "fieldnames", "commit") that you can't use, because they have special meaning to the handler. But with this change they become legitimate field names. For example, maybe I want each of my documents to have a "commit" field that describes who made the most recent relevant commit in a version control system.

  /update/rich/...commit=true&f.commit=chris

I can't think of any downsides right now, other than breaking people's code. (I do admit that is a downside.)

Any comments?


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Pugh updated SOLR-284:
---------------------------

    Attachment:     (was: rich.patch)

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724938#action_12724938 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

I will review your comments more tomorrow.  Still waist deep in boxes from the move!

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Rogério Pereira Araújo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614992#action_12614992 ] 

Rogério Pereira Araújo commented on SOLR-284:
---------------------------------------------

Who is working on tika based handler? The work on tika based handler can be started or it isn't mature enougth?

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-284:
------------------------------

    Attachment: SOLR-284.patch

Small change to the 2008-11-26 09:18 AM SOLR-284.patch (my previous one), this time adding an "example" ant target to contrib/javascript/build.xml. (Without this top-level "ant example" was failing.)

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756259#action_12756259 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

Grant and company: I just noticed that the example solrconfig.xml at the head of SVN trunk still uses map, not fmap. (In particular, there's "map.content", "map.a", and "map.div".) I assume this should be fixed for the 1.4 release. Interestingly, this doesn't seem to make any unit tests fail.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650634#action_12650634 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

bq. Tika doesn't need to do this explicitly.... you know all fields coming out of your call to the Tika API will be Tika fields. Solar Cell could map all Tika output fields to tika_* where * is the Tika outputted field name. And with field name mapping this default would be overridden, say tika_title mapped to "title". 

I can add in an option to have it do this mapping.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656018#action_12656018 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

Forgot a couple of things on this:

1. To hook into the release/javadoc mechanism.
2. In order to facilitate separation of the javadocs and other things, I'm going to move the code to o.a.s.handler.extraction package.
3. Need to publish the Maven artifacts.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-284:
---------------------------------

    Attachment: SOLR-282.patch

Fix issue with literal mapping

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-282.patch, SOLR-282.patch, SOLR-282.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Pugh updated SOLR-284:
---------------------------

    Attachment: rich.patch

Patch file for adding new handler and test cases.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: rich.patch
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647618#action_12647618 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

Question for the people watching this:

Would you prefer a new wiki page and keep the old one for those using Chris/Eric's patch, or would you rather I overwrite/edit the current one?

FWIW, some of the parameters will be the same, but I'm also adding in quite a bit more: boosting, XPath expression support (Tika returns everything as XHTML, so it then becomes possible to restrict down what parts you want to pay attention to), extraction only (i.e. no indexing), support for metadata extraction and indexing, support for sending in "literals" which are like the current fieldnames parameter and likely some other pieces.

FYI: Out of the box, Tika has support for: http://incubator.apache.org/tika/formats.html and I know they are adding more things as well, like Flash, etc.

It should also be noted, that if you are just indexing metadata about a file, it makes more sense to do the work on the client side.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724862#action_12724862 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

I just tried setting ext.idx.attr=false, and I didn't see any change after indexing a PDF.
Perhaps we don't even need this option if we map attributes to an ignored_ field that is ignored?
In any case, the default seems like it should generate / index attributes.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-284:
------------------------------

    Attachment: rich.patch

Trivial update to merge cleanly against r685275.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Pugh updated SOLR-284:
---------------------------

    Attachment: test.zip

add the test code for richdocumenthandler.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594992#action_12594992 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

Why not just use Tika (or Aperture, but it's license isn't as friendly)?  Doesn't make sense to reinvent the wheel here.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Klaas updated SOLR-284:
----------------------------

    Fix Version/s:     (was: 1.3)

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-284:
------------------------------

    Attachment: un-hardcode-id.diff

The patch, as currently stands, treats a field called "id" as a special case. First, it is a required field. Second, unlike any other field, you don't need to declare it in the fieldnames parameter. Finally, since the fieldSolrParams.getInt(), that field is required to be an int.

This special-case treatment seems a little too particular to me; not everyone wants to have a field called "id", and not everyone who does wants that field to be an int. So what I propose is to eliminate the special treatment of "id". See un-hardcode-id.diff for what this might mean in particular. (That file is not complete; to correctly make this change, I'd have to update the test cases.)

This is a breaking change, because if you *are* using an id field, you'll now have to specifically indicate that fact in the fieldnames parameter. Thus, instead of

http://localhost:8983/solr/update/rich?stream.file=myfile.doc&stream.type=doc&id=100&stream.fieldname=text&fieldnames=subject,author&subject=mysubject&author=eric

you'll have to put

http://localhost:8983/solr/update/rich?stream.file=myfile.doc&stream.type=doc&id=100&stream.fieldname=text&fieldnames=id,subject,author&subject=mysubject&author=eric

I think asking users of this patch to make this slight change in their client code is not an unreasonable burden, but I'm curious what Eric and others have to say.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Pugh updated SOLR-284:
---------------------------

    Description: 
I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.


There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
 

  was:
I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.

I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.

 


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-284:
---------------------------------

    Attachment: SOLR-282.patch

First crack at this.  You'll need to download http://people.apache.org/~gsingers/extraction-libs.tar as it is too big to fit in JIRA.

There's probably lots wrong with it, so be gentle!  See http://wiki.apache.org/solr/ExtractingRequestHandler to get started.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-282.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646947#action_12646947 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

Some initial thoughts on moving forward:

I think we can add some generic functionality here via the request params:

1. Tika can provide a lot of metadata about a document.  By metadata, I mean things like the actual author, pages, etc. as provided by the document, not the hardcoded metadata in the http://wiki.apache.org/solr/UpdateRichDocuments.  The hardcoded metadata is also useful and should be retained.  With these, we then need a way to map fields from Tika's metadata to Solr fields.  If no mapping is specified, it tries to use the Tika metadata name as the field name.  If that doesn't exist, then we can rely on dynamic fields or we can allow for a param that passes in the name of a default field to map to.

2.  We can auto detect the mime type or allow for it to be passed in.  Thus, stream.type becomes optional, but is still useful.

3. Tika provides a mechanism for implementing your own SAX ContentHandler and passing that in.  I will likely make this pluggable such that people can provide there own.  I _think_ this would allow people to make even further refinements to the content (i.e. splitting on paragraphs or other things like that?????)

I should have a start of a patch today or tomorrow.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595372#action_12595372 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

I'm on the fence about whether this patch makes sense to include in Solr right now. One thing I'm wondering, though: Can we assess the odds at this point whether it could make sense for a Tika-based handler to offer the same public interface that the handler in this patch presents? That is, even if the underlying implementation were switched to Tika at some point, could we avoid changing the URL schema and such that Solr clients would use to interact with it?

If it's likely that the public interface could indeed remain the same for the first Tika-based handler release (or at least more or less the same), would this alleviate any of Grant's concerns?

Also, would putting this handler into a contrib directory rather than in the main code base, as has been mentioned on the mailing list, make committing it any less problematic?

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-284:
---------------------------------

    Attachment: SOLR-284.patch

Let's name the patch right, eh?

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-282.patch, SOLR-282.patch, SOLR-282.patch, SOLR-284.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541879 ] 

Eric Pugh commented on SOLR-284:
--------------------------------

Juri,

Thanks for the vote on the issue!  The next time I update this patch to work with the latest code, I'll apply your change.  Since this is still a pending patch, I am not actively maintaining it.  Thanks for voting for this patch, there is only one other patch with more votes, hopefully it will be added soon.  I'd love to hear what the use case you have for this patch is.


https://issues.apache.org/jira/browse/SOLR?report=com.atlassian.jira.plugin.system.project:popularissues-panel

Eric


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Jonathan Hipkiss (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12551506 ] 

Jonathan Hipkiss commented on SOLR-284:
---------------------------------------

This is crucial functionaility if Solr is to be accepted as a solution in any organisation.  A search engine that can't parse Microsoft or other closed formats is useless to most organisations.
This is a MUST!

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583231#action_12583231 ] 

Eric Pugh commented on SOLR-284:
--------------------------------

Chris,  I like what you are thinking...  Really this is sort of becoming the AllDocumentsUnderTheSunRequestHandler, but what that highlights is that the current solution really doesn't do what we need, which is making it dirt simple to add new handlers...   

While there are some efforts under way to do that, to provide the "uber" solution, I think adding another hack/method to RichDocumentRequestHandler is cool with me.  Since it's just a patch file, feel free to take it, munge it, and post it back as the "current" patch.  If you do, make sure to add to the docs on the wiki at http://wiki.apache.org/solr/UpdateRichDocuments.

Heck, you may want to rip in Pompo's fix as well!

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved SOLR-284.
----------------------------------

    Resolution: Fixed

Committed revision 815293.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-284:
---------------------------------

    Attachment: SOLR-284.patch

Fix an issue w/ XPath and extract only.  See http://tika.markmail.org/message/kknu3hw7argwiqin

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-284:
------------------------------

    Attachment: SOLR-284.patch

This should be the last change for today.

This change adds a resource.name parameter that you can pass to the handler. (I'm guessing you'll probably typically pass a filename, though Tika does use the more general term "resource name".) If you provide it, Tika can take advantage of it when applying its heuristics to determine the MIME type.

Affected files:

 * ExtractingParams.java
 * ExtractingDocumentLoader.java


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Erik Hatcher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647619#action_12647619 ] 

Erik Hatcher commented on SOLR-284:
-----------------------------------

I'd rather see the old (err, current) wiki page replaced/renamed, and kept current with the latest patch/commit from this issue.  Nice work Grant!

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595007#action_12595007 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

I'm not sure this patch entirely reinvents the wheel, as it does most of the heavy lifting with preexisting components, namely PDFBox, POI, and Solr's own HTMLStripReader. It also has the advantage of already existing, whereas tying Solr to Tika or Aperture would take additional effort.

Tika or Aperture do look really nice, though. The most obvious advantage these projects have over this patch is that they can already extract text from more file formats than this patch, and that the developers will probably continue to add more file formats over time. Are you thinking of additional advantages on top of this, Grant? Do you have any cool ideas about how Tika/Aperture's metadata extraction facilities might be integrated into Solr? Is there a potentially interesting interface between Aperture's crawling facilities and Solr?

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-284:
---------------------------------

    Attachment: SOLR-284.patch

Adds in defaultField parameter and tests.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582062#action_12582062 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

I'm thinking it would be handy if RichDocumentRequestHandler could support indexing text and HTML files, in addition to the fancier formats (pdf, doc, etc.). That way I could use RichDocumentRequestHandler for all my indexing needs (except commits and optimizes), rather than use it for for some doc types but still have to use XmlUpdateRequestHandler for text and HTML docs. Would anyone else find this useful?

I skimmed the source, and adding support for text files looks trivial. (It's just a pass-through.) And if you had this, then I guess you'd have at least one version of HTML support for free; in particular, you could upload your HTML file to RichDocumentRequestHandler, telling the handler that the document is in plain text format, and then strip off the HTML tags later by using the HTMLStripStandardTokenizer in your schema.xml.

Alternatively, RichDocumentRequestHandler could provide its own explicit HTML to text conversion. There would probably be some advantages to this, but I'm not sure exactly what they would be. One, I guess, would be that you could use tokenizers that didn't make use of HTMLStripReader.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525750 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

In regards to Tika not having any code, you may also find http://aperture.sourceforge.net does many of the same things for handling different file formats, etc.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Work started: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on SOLR-284 started by Grant Ingersoll.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662793#action_12662793 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

bq.  Hmmm ... that means that if i have a schema with a uniqueKey field, and i forget to specify a uniqueKey value when indexing my document, the handler will "silently succeed" in adding a document with a key i have no control over instead of failing in a way that will make me aware of my mistake - and i have no way of configuring solr to prevent that kind of "silent success"

Actually, there is a mechanism for avoiding it, and it is documented on in http://wiki.apache.org/solr/ExtractingRequestHandler#head-6cda7b8832bb2ccaf6b0b57a6ef524b553db489e

I could, however, see adding a flag to specify whether one wants "silent success" or not.  I think the use case for content extraction is different than the normal XML message path.  Often times, these files are quite large and the cost of sending them to the system is significant.  

Another thing that might be interesting to do is to actually return in the the response the generated id.


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724937#action_12724937 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

bq. I just tried setting ext.idx.attr=false, and I didn't see any change after indexing a PDF.

This is often needed for HTML, where it is used to index the attributes of tags.  Same would go for XML.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754646#action_12754646 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

bq. What's the real use-case, to be able to search all metadata? One could use a dynamic copyField into a single indexed field. That also helps if one sttill wants to keep all of the stored values for the metadata in separate fields.

Yeah, that works too, but it is convoluted and I may not care about storing the attributes nor want to deal with copyFields and the extra performance costs.  It just seems easier to have a default field capability.  Then one can just have everything go to it.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725355#action_12725355 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

bq. ext.ignore.und.fl

I think this should be kept and this is a case where we should silently ignore.  Parsing rich data is a different beast than normal Solr XML or other structured content.  There are a lot of times where you only want to get specific fields and there can be a large number of fields.  It is burdensome to have to add the ignores for all the metadata.  Not to mention different types may have different metadata.  So, -1 on removing.

bq. ext.idx.attr

Yes, we may want it to be false.  That's why I put it in!  :-)  It can be used to extract things like HREF into other fields or not.  Think faceting.

bq. ext.metadata.prefix

This is not a mapping thing so much as a way to separately handle metadata fields from the main text fields.  I'm not sure if it differs from the uprefix approach you are proposing except you can know exactly what is metadata and what isn't.


Other questions that Yonik brought up:

1. I don't think trying to auto map is a good idea.  New file formats will have new ways of doing them, it's better to have the user handle it.  
2. Fine with dropping ext for common names
3. Metadata is often not useful and I don't think we need to do work as suggested.  See Eric's comment above.
4. Enabling by default is fine.
 

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650359#action_12650359 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

{quote}
I'm not familiar with the state of the patch, but i'm assuming that (by default) all of the metadata fields produced by tika have a common naming convention - either in terms of a common prefix or a common suffix. in which case people can always make a dynamicField declaration to ignore all metadata fields not already explicitly declared.
{quote}

No, they don't, but that is a good idea for Tika.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724856#action_12724856 ] 

Eric Pugh commented on SOLR-284:
--------------------------------

I am out of the office 6/29 - 6/30.  For urgent issues, please contact
Jason Hull at jhull@opensourceconnections.com or phone at (434)
409-8451.


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649986#action_12649986 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

{quote}
I think I like where this is going.
{quote}

Great!  I think the nice thing is as Tika grows, we'll get many more formats all for free.  For instance, I saw someone working on a Flash extractor.

{quote}
Currently the default is ext.ignore.und.fl (IGNORE_UNDECLARED_FIELDS) == false, which means that if Tika returns a metadata field and you haven't made an explicit mapping from the Tika fieldname to your Solr fieldname, then Solr will throw an exception and your document add will fail. This doesn't seem sound very robust for a production environment, unless Tika will only ever use a finite list of metadata field names. (That doesn't sound plausible, though I admit I haven't looked into it.) Even in that case, I think I'd rather not have to set up a mapping for every possible field name in order to get started with this handler. Would true perhaps be a better default?
{quote}

I guess I was thinking that most people will probably start out with this by sending their docs through the engine and see what happens.  I think an exception helps them see sooner what they are missing.  That being said, I don't feel particularly strong about it.   It's easy enough to set it to true in the request handler mappings.    From what I see of Tika, though, the possible values for metadata is fixed within a version.  Perhaps the bigger issue is what happens when someone updates Tika to a newer version with newer Metadata options.

{quote}
ext.capture / CAPTURE_FIELDS: Do you have a use case in mind for this feature, Grant? The example in the patch is of routing text from <div> tags to one Solr field while routing text from other tags to a different Solr field. I'm kind of curious when this would be useful, especially keeping in mind that, in general, Tika source documents are not HTML, and so when <div> tags are generated they're as much artifacts of Tika as reflecting anything in the underlying document. (You could maybe ask a similar question about ext.inx.attr / INDEX_ATTRIBUTES.)
{quote}

For capture fields, it's similar to a copy field function.  Say, for example, you want a whole document in one field, but also to be able to search within paragraphs.  Then, you could use a capture field on a <p> tag to do that.  Thus, you get the best of both worlds.  The Tika output, is XHTML.

Also, since extraction is happening on the server side, I want to make sure we have lots of options for dealing with the content.  I don't know where else one would have options to muck with the content post-extraction, but pre-indexing.  Hooking into the processor chain is too late, since then the Tika structure is gone.  That's my reasoning, anyway.  

Similarly, for index attributes.  When extracting from an HTML file, and it comes across anchor tags (<a>) it will provide the attributes of the tags as XML attributes.  So, one may want to extract out the links separately from the main content and put them into a separate field.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Juri Kuehn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541535 ] 

Juri Kuehn commented on SOLR-284:
---------------------------------

Hi Eric, thank you for this handler, works like a charm!
I need to use non-numeric ids which are fine with solr but are rejected by RichDocumentRequestHandler. I'm not familiar with the solr-code, i patched RichDocumentRequestHandler.java to not to convert id to int, which didn't cause trouble so far:

{code:title=RichDocumentRequestHandler.java.patch}
Index: RichDocumentRequestHandler.java
===================================================================
--- RichDocumentRequestHandler.java	(revision 0)
+++ RichDocumentRequestHandler.java	(working copy)
@@ -133,7 +133,7 @@
 	String streamFieldname;
 	String[] fieldnames;
 	SchemaField[] fields;
-	int id;
+	String id;
 	  
 	final AddUpdateCommand templateAdd;
 
@@ -153,7 +153,7 @@
 	    String fn = params.get(FIELDNAMES);
 	    fieldnames = fn != null ? commaSplit.split(fn,-1) : null;
 	    
-	    id = params.getInt(ID);
+	    id = params.get(ID);
 
 		templateAdd = new AddUpdateCommand();
 		templateAdd.allowDups = false;
@@ -202,7 +202,7 @@
 	 * @param desc
 	 *            TODO
 	 */
-	void doAdd(int id, String text, DocumentBuilder builder, AddUpdateCommand template)
+	void doAdd(String id, String text, DocumentBuilder builder, AddUpdateCommand template)
 	throws IOException {
 
 	  // first, create the lucene document
@@ -225,7 +225,7 @@
 	  handler.addDoc(template);
 	}
 
-	void addDoc(int id, String text) throws IOException {
+	void addDoc(String id, String text) throws IOException {
 		templateAdd.indexedId = null;
 		doAdd(id, text, builder, templateAdd);
 	}
{code}

Tests were ok, maybe you can apply it to your sources.

Best regards,
Juri

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652993#action_12652993 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

Currently this patch deploys the Tika libs to /trunk/example/solr/lib. I'm curious where the Tika handler's lib/ directory is supposed to go in a multicore deployment. I created my own multicore setup more or less like this:

* ant example
* Copy /trunk/example to /trunk/solr-10000
* Copy /trunk/solr-10000/multicore/* to /trunk/solr-10000/solr.

(Solr-10000 means "copy of Solr I plan to run on port 10000.")

This seems to be the easiest way to set things up so that I can cd to /trunk/solr-10000 and run start.jar to get multicore Solr running.

Or rather, that *would* get multicore Solr running, except that Solr gets a can't-find-the-Tika-classes exception. So I guess /trunk/solr-10000/solr/lib is not where the lib directory goes for multicore deployment.

So I tried putting Tika libs instead in /trunk/solr-10000/solr/core0/lib, and that loaded fine. That doesn't seem like the right place for the directory, though; it seems like each core shouldn't have to have its own separate copy of the Tika libs.

So where *do* the Tika libs go?


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Pugh updated SOLR-284:
---------------------------

    Attachment: rich.patch

Updated patch file, properly handling missing stream.types, and cleaning up error messages a bit.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Pugh updated SOLR-284:
---------------------------

    Attachment: rich.patch

Updated to SVN revision 555996

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595015#action_12595015 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------




I think Tika will actually take less effort, as you only need one  
interface, as I understand it.  You don't need separate handlers for  
each type, we just need to write the interface between Solr and Tika.

Nutch is already using Tika.


+1


Yes, someone else maintains the code.  We just maintain the interface  
and upgrade when appropriate.


well, metadata makes for nice fields to sort, filter and facet on,  
right?


I think it is more likely that you will see Nutch integration w/ Solr  
(in fact, there is already a patch for it), but yeah, I think it makes  
sense to consider Solr as a sink for any crawler.

Some of this also overlaps w/ the Data Import Request Handler on  
SOLR-469.   I don't think we want to get Solr into the crawling game,  
but we also shouldn't prevent it from playing nicely with crawlers  
(not saying it doesn't already)


--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509676 ] 

Ryan McKinley commented on SOLR-284:
------------------------------------

I haven't run this patch, but have a few questions...

What is the *general* approach to extract a lucene document (list of fields) from a PDF? Word? Powerpoint?

Is this just access to a few common fields like author, keywords, text, etc?  Is this something that realistically would need to be custom for each case?  

Perhaps it makes sense to add a contrib section for this sort of stuff.  It seems weird to add 10 library dependencies to the core distribution.  How does nutch handle this?
 


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647003#action_12647003 ] 

Eric Pugh commented on SOLR-284:
--------------------------------

Grant,  I am really excited that you are looking at this patch!  

While I am proud of it, and very proud of the number of organizations that have used it, and the people who have improved it (Thanks Chris!); it was just written to scratch an itch, and feel free to rip it apart to come up with a better solution for Solr.  The ability for Solr to injest more formats I think is key aspect, not how this patch works.




> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-284:
------------------------------

    Attachment: SOLR-284.patch

Changes since my previous upload:

* sync CHANGES.txt with trunk
* test cases for adding plain text data
* you aren't forced to map a field if you use the resource.name parameter


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649542#action_12649542 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

Is the latest patch supposed to contain a file "solr-word.pdf"? I don't see one, and my "ant test" is failing along these lines:

        org.apache.solr.common.SolrException: java.io.FileNotFoundException: solr-word.pdf (The system cannot find the file specified)
	at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:160)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313)
	at org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:331)
	at org.apache.solr.handler.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.java:97)
	at org.apache.solr.handler.ExtractingRequestHandlerTest.testExtraction(ExtractingRequestHandlerTest.java:27)


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728144#action_12728144 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

The current ext.metadata.prefix parameter adds the prefix to all attributes, even those that have already been mapped (so last_modified appears instead as attr_last_modified).  Seems like one really wants a prefix appended only for those params that are not explicitly mapped (or don't appear in the schema)... this is what the proposed "uprefix" (unknown field prefix) would do.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615057#action_12615057 ] 

Otis Gospodnetic commented on SOLR-284:
---------------------------------------

I don't think anyone is working on it (publicly), so you are welcome to contribute it.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-284:
---------------------------------

    Attachment: solr-word.pdf

Here's the solr-word PDF.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726116#action_12726116 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

bq. It is burdensome to have to add the ignores for all the metadata.

It would be easy to change the default from index to ignore:
uprefix=ignored_    // ignored_ will be defined in the schema as indexed=false, stored=false
uprefix=attr_

Actually, that brings up another random question... when we get the metadata back from Tika, is it typed (can we tell that number of pages is an integer?)


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-284:
---------------------------------

    Attachment: SOLR-284-no-key-gen.patch

Remove Key Generation.  Will commit shortly

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650365#action_12650365 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

{quote}
The 2008-11-15 01:12 PM version of SOLR-284.patch contains modifications to client/java/solrj/src/org/apache/solr/client/solrj/util/ClientUtils.java related to date handling. That's not intentional, is it?
{quote}

Yes, it is intentional.  The user will need to be able to pass in/configure their own Date formats for their documents and the implementation has to be able to map those to Solr's canonical date format.  Thus, I moved the date handling stuff to a "common" DateUtils class (and deprecated it in ClientUtils) because it is needed on the server side too.  Unfortunately, it looks like I did some reformatting on the class as a whole, too.  Sorry 'bout that.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-284:
---------------------------------

    Attachment:     (was: SOLR-282.patch)

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-282.patch, SOLR-284.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-284:
---------------------------------

    Attachment:     (was: SOLR-282.patch)

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-282.patch, SOLR-282.patch, SOLR-284.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated SOLR-284:
----------------------------------

    Fix Version/s: 1.4

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Klaas updated SOLR-284:
----------------------------

    Affects Version/s:     (was: 1.3)

Removing from 1.3.  No committer has taken ownership.

(It might make sense as a contrib, but I can see the argument for not duplicating tika)

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724855#action_12724855 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

Not sure if I should open a new issue or keep improvements here.
I think we need to improve the OOTB experience with this...
http://search.lucidimagination.com/search/document/302440b8a2451908/solr_cell

Ideas for improvement:
- auto-mapping names of the form Last-Modified to a more solrish field name like last_modified
- drop "ext." from parameter names, and revisit naming to try and unify with other update handlers like CSV
  note: in the future, one could see generic functionality like boosting fields, setting field value defaults, etc, being handled by a generic component or update processor... all the better reason to drop the ext prefix.
-  I imagine that metadata is normally useful, so we should
  1. predefine commonly used metadata fields in the example schema... there's really no cost to this
  2. use mappings to normalize any metadata names (if such normalization isn't already done in Tika)
  3. ignore or drop fields that have little use
  4. provide a way to handle new attributes w/o dropping them or throwing an error
- enable the handler by default - lazy to avoid a dependency on having all the tika libs available


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-284:
------------------------------

    Attachment: schema_update.patch

Attaching a schema update to define some common useful metadata fields to improve the OOTB experience.
Any concerns or suggestions for improvements?  I'd like to commit shortly to get it into 1.4



> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728303#action_12728303 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

The date.format thing is interesting.... but shouldn't that really be part of a Date fieldType that can accept all those formats?
Transforming in the update handler only means that you could add a literal.mydate=date1 via the update handler, and then fail to query it (because the date parsing was specific to the update handler.)

Perhaps we could add this to the new trie field for dates?

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726113#action_12726113 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

bq. My only request is that, if you're changing how field mapping works and maybe removing ext.ignore.und.fl, you make sure it stays easy to say, "Tika, I don't care about any of your parsed metadata.

Map unknown fields to an ignored fieldtype.
uprefix=ignored_

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-284:
------------------------------

    Attachment: rich.patch

Attaching another patch revision. I've been totally asleep at the wheel today, and my previous one contained not only the feature described in this JIRA issue but also the Data Import RequestHandler patch (SOLR-469). Hopefully I've finally made a patch that's actually correct. I can at least promise that the unit tests pass when applied to r654253.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

Grant,

 > Attachment: SOLR-282.patch

Isn't it SOLR-284.patch?

Grant Ingersoll (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Grant Ingersoll updated SOLR-284:
> ---------------------------------
>
>     Attachment: SOLR-282.patch
>
> Captured fields weren't being indexed properly.
>
>   
>> Parsing Rich Document Types
>> ---------------------------
>>
>>                 Key: SOLR-284
>>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>>             Project: Solr
>>          Issue Type: New Feature
>>          Components: update
>>            Reporter: Eric Pugh
>>            Assignee: Grant Ingersoll
>>             Fix For: 1.4
>>
>>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-282.patch, SOLR-282.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>>
>>
>> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
>> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>>  
>>     
>
>

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-284:
---------------------------------

    Attachment: SOLR-282.patch

Captured fields weren't being indexed properly.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-282.patch, SOLR-282.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595365#action_12595365 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

I don't agree on committing it.  If Tika is the right solution, then  
we should work towards Tika.  Not saying this isn't good, just saying  
it's going to create more maintenance than we want and then we just  
end up deprecating it in the near future.






> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-284:
---------------------------------

    Attachment:     (was: SOLR-282.patch)

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-282.patch, SOLR-282.patch, SOLR-282.patch, SOLR-284.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726122#action_12726122 ] 

Yonik Seeley edited comment on SOLR-284 at 7/7/09 11:08 AM:
------------------------------------------------------------

>> I just tried setting ext.idx.attr=false, and I didn't see any change after indexing a PDF.
> This is often needed for HTML, where it is used to index the attributes of tags. Same would go for XML.

That's confusing given that the examples on the wiki show PDFs being indexed with ext.idx.attr=true

It also confused me since the docs say "Index the Tika XHTML attributes into separate fields, named after the attribute." and the docs also say "Tika does everything by producing an XHTML stream that it feeds to a SAX ContentHandler".
That led me to believe that ext.idx.attr was for all tika generated metadata (or maybe it is, but tika doesn't generally use attributes?)

It's also rather confusing just what rules can be applied to what.  For example, does ext.metadata.prefix work on stuff produced by ext.idx.attr?
edit: nope, I just tried, and that does not work.

      was (Author: yseeley@gmail.com):
    >> I just tried setting ext.idx.attr=false, and I didn't see any change after indexing a PDF.
> This is often needed for HTML, where it is used to index the attributes of tags. Same would go for XML.

That's confusing given that the examples on the wiki show PDFs being indexed with ext.idx.attr=true

It also confused me since the docs say "Index the Tika XHTML attributes into separate fields, named after the attribute." and the docs also say "Tika does everything by producing an XHTML stream that it feeds to a SAX ContentHandler".
That led me to believe that ext.idx.attr was for all tika generated metadata (or maybe it is, but tika doesn't generally use attributes?)

It's also rather confusing just what rules can be applied to what.  For example, does ext.metadata.prefix work on stuff produced by ext.idx.attr?
  
> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650368#action_12650368 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

I like how Erik has given names to contribs, etc.: Flare, Celeritas, etc.  So, I thought I would give one too:

I was typing the javadocs and wrote "Solr Content Extraction Library".   Which then lead me to "Solr Cell" as the project name?  http://en.wikipedia.org/wiki/Solar_cell  It's also nice, b/c a Solar Cell's job is to convert the raw energy of the Sun to electricity, and this contrib's module is responsible for "raw" content of a document to something usable by Solr.

I know, I know, get a life...  ;-)  Still, it beats "ExtractingRequestHandler" as a name!

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724871#action_12724871 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

Apologies for not reviewing this sooner after it was committed - but this is the last/best chance to improve the interface before 1.4 is released (and this is very important new functionality).

Since the "ext." seems unnecessary and removing is already a name change, we might as well revisit the names themselves anyway.  Here are my first thoughts on it:
{code}
//////// generic type stuff that could be reused by other update handlers
boost.myfield=2.3
literal.myfield=Hello
map.origfield=newfield
uprefix=attr_ 
  // map any unknown fields using a standard prefix... good for
  // dynamic field mapping.

//////// more solr cell specific
capture.target_field=div
  // does capture + field-map in single step... avoids name clashes
xpath=xpath_expr
  // future: could do xpath.targetfield=xpath_expr
extract_only=true  // period's aren't word separators, but scoping operators
 // in the future, this could be replaced with a generic update operation
 // to return the document(s) instead of indexing them.
resource.name=test.pdf

New idea:
  nicenames=true // Last-Modified -> last_modified


REMOVED:
ext.ignore.und.fl 
  // throwing an exception when a field-type doesn't exist is generic
  // and not needed.  we should never silently ignore.
ext.idx.attr
  // do we ever want this to be false?  we can ignore all attributes
  // with field mappings if we want to
ext.metadata.prefix
  // seems like we only want to map unknown fields, not all fields
ext.def.fl 
  // we can use a standard field name for indexing main content
  // and use map to move it if desired. "content"? 
{code}

Do people view this as an improvement?

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726122#action_12726122 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

>> I just tried setting ext.idx.attr=false, and I didn't see any change after indexing a PDF.
> This is often needed for HTML, where it is used to index the attributes of tags. Same would go for XML.

That's confusing given that the examples on the wiki show PDFs being indexed with ext.idx.attr=true

It also confused me since the docs say "Index the Tika XHTML attributes into separate fields, named after the attribute." and the docs also say "Tika does everything by producing an XHTML stream that it feeds to a SAX ContentHandler".
That led me to believe that ext.idx.attr was for all tika generated metadata (or maybe it is, but tika doesn't generally use attributes?)

It's also rather confusing just what rules can be applied to what.  For example, does ext.metadata.prefix work on stuff produced by ext.idx.attr?

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Pugh updated SOLR-284:
---------------------------

    Attachment:     (was: rich.patch)

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730929#action_12730929 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

OK, I've committed the above.  I'll work on updating the wiki, including clarifying things that didn't make sense the first time I looked at this.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627309#action_12627309 ] 

Eric Pugh commented on SOLR-284:
--------------------------------

So, in typical open source fashion, I wrote the original patch to scratch my own itch.  Which meant that it was okay to make id be hardcoded.  However, even when I first posted the patch to this JIRA issue, I felt a little "icky" about the id field.  It seemed like a code smell to have this magic id!   So, from that standpoint, I think the changes that Chris has posted look great.  

I think it's a good example of a patch getting better and better everytime someone else uses it!

Now, if only this almost 14 month old patch could be applied!  With 28 votes, and 16 active watches, clearly somebody out there finds this useful!   

And at this point it is miles better then what I first posted!  Keep up the great work and great contributions back!

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663150#action_12663150 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

bq. I could, however, see adding a flag to specify whether one wants "silent success" or not. I think the use case for content extraction is different than the normal XML message path. Often times, these files are quite large and the cost of sending them to the system is significant.

In my own use case of the handler, I imagine the fail-on-missing-key policy would be the more helpful policy. This is because I want to be in control of my own key, and if Solr fails as soon as I don't provide one, that's going to help me find the bug in my indexing code right away, whereas "silent success" will allow that bug to fester. I'm not sure there would be significant countervailing advantages to the other policy. It's true that transferring a large file when you're just going to get an error message wastes some time, but I feel like in debugging there's potential to waste a lot more time.

My first choice would be for fail-on-missing-key to be the default, followed by having an easy-to-set flag. In any case, though, it would be nice not to have to create a custom SolrContentHandler just to get this one sanity check.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662660#action_12662660 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

bq. Should the schema designer just use the UUID field type or decide to have no unique key field? This seems more modular and follows other aspects of the design.

I guess I usually prefer having a unique key field, as it always gives you that one last handle to grab on to to find a specific document.  However, I'm not sure I follow what you mean by having no uniq. field being more modular.

I put in the code b/c I figured it was better to generate an ID than to outright reject the document, since unlike when adding XML, sending large files can be really expensive, so I wanted it to handle as many edge cases as possible and still accept a document.

Here's the code:
{code}
SchemaField uniqueField = schema.getUniqueKeyField();
    if (uniqueField != null) {
      String uniqueFieldName = uniqueField.getName();
      SolrInputField uniqFld = document.getField(uniqueFieldName);
      if (uniqFld == null) {
        String uniqId = generateId(uniqueField);
        if (uniqId != null) {
          document.addField(uniqueFieldName, uniqId);
        }
      }
    }
{code}

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754644#action_12754644 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

bq. it is often the case where one wants all values that aren't explicitly mapped to go into a default field

What's the real use-case, to be able to search all metadata?  One could use a dynamic copyField into a single indexed field.  That also helps if one sttill wants to keep all of the stored values for the metadata in separate fields.


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755053#action_12755053 ] 

David Smiley commented on SOLR-284:
-----------------------------------

Grant, your response confuses me.  How does a copyField _necessitate_ storing the fields?  And how is the copyField slower than this feature mapping to a common attribute which ends up with an equivalent outcome?

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725237#action_12725237 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

bq. Apologies for not reviewing this sooner after it was committed - but this is the last/best chance to improve the interface before 1.4 is released (and this is very important new functionality).

My only request is that, if you're changing how field mapping works and maybe removing ext.ignore.und.fl, you make sure it stays easy to say, "Tika, I don't care about any of your parsed metadata. Please leave it out of my Solr index." In my current use case I already know all the metadata I want, and including the Tika-parsed fields would result in index bloat. (My temptation would be to make excluding Tika-parsed fields the default, though it sounds like other people have the opposite inclination.)


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756266#action_12756266 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

bq. example solrconfig.xml at the head of SVN trunk still uses map, not fmap.

Thanks, I just fixed this.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583233#action_12583233 ] 

Eric Pugh commented on SOLR-284:
--------------------------------

Oh, and don't forget to vote for it as well:

https://issues.apache.org/jira/browse/SOLR?report=com.atlassian.jira.plugin.system.project:popularissues-panel

It's the current leading vote getter!

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-284:
------------------------------

    Attachment: SOLR-284.patch

As I mentioned before, tests for these 

  solr.client.solrj.embedded.JettyWebappTest
  solr.client.solrj.embedded.LargeVolumeJettyTest
  solr.client.solrj.embedded.SolrExampleJettyTest
  solr.client.solrj.embedded.TestSpellCheckResponse

were failing, with Solr giving a classnotfoundexception for one of the extracting document loader (ie Solr Cell) classes.

This revision fixes this by removing all references to this Tika handler from /trunk/example/conf/solrconfig.xml and /trunk/example/conf/schema.xml. Note that these references still exist (and are still used for testing) in /trunk/contrib/extraction/src/test/resources/solr/conf.

There are probably other ways to make these tests pass, perhaps involving changing the setUp() methods for the above mentioned tests' java files. (For example, maybe you could fiddle with the path parameter passed to the WebAppContext constructor in JettyWebappTest.java? I don't really know anything about this embedded stuff.) I like the current approach, though, because it avoids further changes to code that's logically independent of this handler.


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647670#action_12647670 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

OK, I've created http://wiki.apache.org/solr/ExtractingRequestHandler and linked it from the old page.  I will have a preliminary patch up today.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662616#action_12662616 ] 

Lance Norskog commented on SOLR-284:
------------------------------------

The ExtractingRequestHandler has its own UUID generation code. 

Should the schema designer just use the UUID field type or decide to have no unique key field?  This seems more modular and follows other aspects of the design.



> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646987#action_12646987 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

{quote}
3. Tika provides a mechanism for implementing your own SAX ContentHandler and passing that in. I will likely make this pluggable such that people can provide there own. I think this would allow people to make even further refinements to the content (i.e. splitting on paragraphs or other things like that?????)
{quote}

Now that I'm digging in more, this actually isn't needed.  The ProcessorChain can be used for this stuff

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755008#action_12755008 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

Yonik, any objections to me committing the current patch given your concerns?

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Rogério Pereira Araújo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656023#action_12656023 ] 

Rogério Pereira Araújo commented on SOLR-284:
---------------------------------------------

Grant, lemme know how can I help.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-284:
------------------------------

    Attachment: rich.patch

This update is just to make a tiny refactoring, bringing all the handler's parsing classes under 

src\java\org\apache\solr\handler\rich

and all the testing classes under 

src\test\org\apache\solr\handler\rich

All tests pass.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647875#action_12647875 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

Still to do:

1. More unit tests

2. We need to do the crypto notice for Solr once this is committed.   See https://issues.apache.org/jira/browse/NUTCH-621 for examples.  I will link a new issue for this so as not to hold up this patch from being committed.  It just needs to be done before releasing 1.4

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-282.patch, SOLR-282.patch, SOLR-282.patch, SOLR-282.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724859#action_12724859 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

ext.capture seems problematic in that one needs a separate ext.map statement to move what you capture... but it doesn't seem to work well if you already have fieldnames that might match something you are trying to capture.

perhaps something of the form
capture.targetfield=expression
would work better?

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663186#action_12663186 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

I guess I'm fine with it.  So, should we remove key generation all together?

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755058#action_12755058 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

bq. How does a copyField necessitate storing the fields?

Yonik suggested his approach helps with stored values for the metadata.

bq. And how is the copyField slower than this feature mapping to a common attribute which ends up with an equivalent outcome? 

As I understand Yonik's response, he was suggesting that I use the uprefix combined with copy fields.  That involves two field entries, when I only care about the one catch all.  copyFields do have a cost, especially when you don't need them.

At any rate, with the patch I put up, you have the option of doing it either way.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-284:
------------------------------

    Attachment: test-files.zip

New version of test-files.zip. Contains new file, simple.txt, that is used by a new unit test for plaintext files.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651073#action_12651073 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

On r720403B, I'm noticing that before I apply this patch tests pass, whereas after I apply this patch the following tests fail:

solr.client.solrj.embedded.JettyWebappTest
solr.client.solrj.embedded.LargeVolumeJettyTest
solr.client.solrj.embedded.SolrExampleJettyTest
solr.client.solrj.response.TestSpellCheckResponse

In each case Solr outputs this exception: "On Solr startup: SEVERE: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.ExtractingRequestHandler'"

I'm not sure the best way to get the ExtractingRequestHandler into the classpath here.

Sort of related, I've noticed that ExtractingRequestHandler doesn't currently get built into the .war file when you run "ant example", in contrast to DataImportHandler, which *does* get put into the .war by means of this target in its build.xml (among other targets):

  <target name="dist" depends="build">
  	<copy todir="../../build/web">
  		<fileset dir="src/main/webapp" includes="**" />
  	</copy>
  	<mkdir dir="../../build/web/WEB-INF/lib"/>
  	<copy file="target/${fullnamever}.jar" todir="${solr-path}/build/web/WEB-INF/lib"></copy>
  	<copy file="target/${fullnamever}.jar" todir="${solr-path}/dist"></copy>
  </target>

Should ExtractingRequestHandler's build.xml perhaps have an analogous "dist" target, along these lines:

  <target name="dist" depends="build">
  	<mkdir dir="../../build/web/WEB-INF/lib"/>
  	<copy file="build/${fullnamever}.jar" todir="${solr-path}/build/web/WEB-INF/lib"></copy>
  	<copy file="build/${fullnamever}.jar" todir="${solr-path}/dist"></copy>
  </target>


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649551#action_12649551 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

A few comment on the ExtractingDocumentLoader:

* I think I like where this is going.

* Currently the default is ext.ignore.und.fl (IGNORE_UNDECLARED_FIELDS) == false, which means that if Tika returns a metadata field and you haven't made an explicit mapping from the Tika fieldname to your Solr fieldname, then Solr will throw an exception and your document add will fail. This doesn't seem sound very robust for a production environment, unless Tika will only ever use a finite list of metadata field names. (That doesn't sound plausible, though I admit I haven't looked into it.) Even in that case, I think I'd rather not have to set up a mapping for every possible field name in order to get started with this handler. Would true perhaps be a better default?

* ext.capture / CAPTURE_FIELDS: Do you have a use case in mind for this feature, Grant? The example in the patch is of routing text from <div> tags to one Solr field while routing text from other tags to a different Solr field. I'm kind of curious when this would be useful, especially keeping in mind that, in general, Tika source documents are not HTML, and so when <div> tags are generated they're as much artifacts of Tika as reflecting anything in the underlying document. (You could maybe ask a similar question about ext.inx.attr / INDEX_ATTRIBUTES.)


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Pugh updated SOLR-284:
---------------------------

    Attachment: rich.patch

Update patches for revision 572774

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-284:
------------------------------

    Attachment: SOLR-284.patch

The 2008-11-15 01:12 PM SOLR-284.patch wasn't applying cleanly to trunk r720403 for me. (One of the hunks for client/java/solrj/src/org/apache/solr/client/solrj/util/ClientUtils.java wouldn't apply.) With this very small update, it does apply cleanly.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-284:
------------------------------

    Attachment: rich.patch

THIS IS A BREAKING CHANGE TO RICH.PATCH! CLIENT URLs NEED TO BE UPDATED!

All unit tests pass.

Changes:

* As suggested earlier, the "id" parameter is no longer treated as a special case; it is not required, and it does not need to be an int. If you *do* use a field called "id", you *must* now declare it in the fieldnames parameter, as you would any other field

* Do updates with with UpdateRequestProcessor and SolrInputDocument, rather than UpdateHandler and DocumentBuilder. (The latter pair appear to be obsolete.)

* Previously if you declared a field in the fieldnames parameter but did not then did not specify a value for that field, you would get a NullPointerException. Now you can specify any nonnegative number of values for a declared field, including zero. (I've added a unit test for this.)

* In SolrPDFParser, properly close PDDocument when PDF parsing throws an exception

* Log the stream type in the solr log, rather than on the console

* Some not-very-thorough conversion of tabs to spaces

As an aside, I've noticed that I failed in my earlier efforts to incorporate Juri Kuehn's change to allow the id field to be non-integer. Sorry about that, Juri; that was not at all intentional.


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650363#action_12650363 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

The 2008-11-15 01:12 PM version of SOLR-284.patch contains modifications to client/java/solrj/src/org/apache/solr/client/solrj/util/ClientUtils.java related to date handling. That's not intentional, is it?

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647748#action_12647748 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

Things to do:

1. Documentation
2. Way more testing, esp. unit tests of the various parameters
3. Update NOTICES and LICENSE.txt for the new dependencies.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-282.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728636#action_12728636 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

bq.  The date.format thing is interesting.... but shouldn't that really be part of a Date fieldType that can accept all those formats?

Agreed, I was just wanting more Date Field Type capabilities the other day.  It would be nice to be able to specify two things on the Date fieldType:
1. Input formats accepted like what the ExtractingRequestHandler offers
2. Output granularity.  That is, may not want to store seconds, etc. so Solr should drop the precision.  Note, this is different from Trie in that it is only indexing one token.

Probably should handle on a separate issue.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651813#action_12651813 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

bq. Sort of related, I've noticed that ExtractingRequestHandler doesn't currently get built into the .war file when you run "ant example", in contrast to DataImportHandler, which does get put into the .war by means of this target in its build.xml (among other targets):

Yes, it does NOT get put into the WAR on purpose.  Unfortunately, I think the DIH does this wrong (but it's probably too late now).  A contrib should be optional, as not everyone wants it/needs it.  Solr Cell works solely by putting it into the Solr Home lib directory and then hooking it into the config.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509683 ] 

Eric Pugh commented on SOLR-284:
--------------------------------

So, I was not attempting to "boil the ocean" and provide the ultimate solution.  Our need was just to take all the raw text and index it in a field, and pass in a bunch of other data fields to be indexed.  

We are parsing a large number of unstructured documents, that may or may not have common fields populated, but fortunately we don't really need them.  Our users aren't searching by author, but by content.  

I think there are only 5 additional libraries, and one (poi-scratchpad) may be able to be removed...

Yonik also mentioned using Tika, as a framework for creating a common interface to these types of rich documents, but Tika is still in incubation and has no code in it!

I originally had separate handlers for each data type, and that was really icky, so I condensed it into the RichDocumentRequestHandler.  I could also merge in the CSVRequestHandler into it as well, by just taking out the logic for parsing CSV and putting it into a CSVParser.  However, the CSVRequestHandler has very complex and rich semantics that these unstructured documents don't really need.



> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved SOLR-284.
----------------------------------

    Resolution: Fixed

I removed the auto key generation: Committed revision 741907.  I think this can officially close out this patch.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654234#action_12654234 ] 

Ryan McKinley commented on SOLR-284:
------------------------------------

Looks like there are a bunch of duplicate .jar files in lib.  You could remove these and use the ones that are already in /lib

{panel}
Index: contrib/extraction/lib/commons-io-1.4.jar
Index: contrib/extraction/lib/commons-codec-1.3.jar
Index: contrib/extraction/lib/commons-lang-2.1.jar
Index: contrib/extraction/lib/commons-logging-1.0.4.jar
Index: contrib/extraction/lib/junit-3.8.1.jar
{panel}

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754626#action_12754626 ] 

Grant Ingersoll edited comment on SOLR-284 at 9/12/09 3:27 PM:
---------------------------------------------------------------

I don't think we should drop ext.def.fl (the name can change, but the functionality is useful) and am going to reopen this.  Namely, it is often the case where one wants all values that aren't explicitly mapped to go into a default field and I don't think that is possible using uprefix.  Since all metadata fields aren't knowable up front, there is currently no way to express this in the ExtractingRequestHandler.

      was (Author: gsingers):
    I don't think we should drop ext.def.fl (the name can change, but the functionality is useful)and am going to reopen this.  Namely, it is often the case where one wants all values that aren't explicitly mapped to go into a default field.  Since all metadata fields aren't knowable up front, there is currently no way to express this in the ExtractingRequestHandler.
  
> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Pugh updated SOLR-284:
---------------------------

    Attachment: libs.zip

new jars to go in trunk/lib for pdf and office parsing...

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-284:
---------------------------------

    Attachment:     (was: SOLR-282.patch)

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656032#action_12656032 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

OK, I just committed:

1. Upgraded to Tika 0.2 official release
2. Put in POM support
3. Hooked in various other build things.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Michel Benevento (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594730#action_12594730 ] 

Michel Benevento commented on SOLR-284:
---------------------------------------

Hi, just new here, I am working on Rich Document support for solr-ruby and acts_as_solr. If you are interested, see prelim results at: http://wiki.apache.org/solr/solr-ruby/BrainStorming

For acts_as_solr I need the ID field to be a String, same as Juri Kuehn above who supplied the fix for this.

Is there a specific reason it was not added to the latest rich.patch? I would appreciate it.

Thanks,
MIchel

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724863#action_12724863 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

Another comment on parameter naming: period is more like a scoping operator, and less like a word separator.  Hence ext.ignore.und.fl is more readable as ext.ignore_undefined or something.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Erik Hatcher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650431#action_12650431 ] 

Erik Hatcher commented on SOLR-284:
-----------------------------------

bq. I'm not familiar with the state of the patch, but i'm assuming that (by default) all of the metadata fields produced by tika have a common naming convention - either in terms of a common prefix or a common suffix. in which case people can always make a dynamicField declaration to ignore all metadata fields not already explicitly declared.

Tika doesn't need to do this explicitly.... you know all fields coming out of your call to the Tika API will be Tika fields.  Solar Cell (I'm on board with that nickname, Grant - now you're catching on :) - thus we could map all Tika output fields to tika_* where * is the Tika outputted field name.  And with field name mapping this default would be overridden, say tika_title mapped to "title".   Just some off the cuff thoughts.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned SOLR-284:
------------------------------------

    Assignee: Grant Ingersoll

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595360#action_12595360 ] 

Otis Gospodnetic commented on SOLR-284:
---------------------------------------

+1 for Tika
But also +1 for committing this in the mean time -- wow, lots of watchers and voters!


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647632#action_12647632 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

Grant,

I don't really care if you take over the old wiki page's name or start a new one; maybe it depends on if the updated handler is still going to have a similar name or be called something else. I do think, though, that it might be handy nice to have *some* wiki page (and maybe some JIRA issue) to maintain the older patch on a temporary basis.

Thanks,
Chris

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Juho-Matti Stenberg (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12569275#action_12569275 ] 

Juho-Matti Stenberg commented on SOLR-284:
------------------------------------------

I wrote a simple patch for RichDocumentUpdateHandler to accept multivalued fields. Just POST the same field name multiple times, e.g. category=TVs&category=Radios

{code:title=RichDocumentRequestHandler.java.patch}
Index: RichDocumentRequestHandler.java
===================================================================
--- RichDocumentRequestHandler.java	(revision 0)
+++ RichDocumentRequestHandler.java	(working copy)
@@ -211,7 +211,10 @@
 	  for (int i =0; i < fields.length;i++){
 	    String fieldName = fields[i].getName();
    
-  	    builder.addField(fieldName,params.get(fieldName),1.0f);
+           String[] values = params.getParams(fieldName);
+           for(String value : values) {
+             	    builder.addField(fieldName,value,1.0f);
+           }
 	      
 	  }
{code}

Seems to work for me.

Best Regards,
Pompo

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Eric Pugh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Pugh updated SOLR-284:
---------------------------

    Attachment: source.zip

Java Source code for RichDocumentRequestHandler and friends.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Kristoffer Dyrkorn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586735#action_12586735 ] 

Kristoffer Dyrkorn commented on SOLR-284:
-----------------------------------------

Very handy!

It could be beneficial to have an option to save the extracted text as xml (so it can be stored) just before adding it to the Solr index. Thus, if the Solr schema needs to be changed (in a way that triggers a full reindex) the content can then be quickly re-fed from a "near source".

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-284:
------------------------------

    Attachment: rich.patch

Here's a new version of rich.patch. My previous attempt didn't actually include all the necessary files! (Curses upon you, TortoiseSVN.) This one also includes preliminary support for plaintext and HTML files. (HTML support is done by running the input through the HTMLStripReader.)

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reopened SOLR-284:
----------------------------------


I don't think we should drop ext.def.fl (the name can change, but the functionality is useful)and am going to reopen this.  Namely, it is often the case where one wants all values that aren't explicitly mapped to go into a default field.  Since all metadata fields aren't knowable up front, there is currently no way to express this in the ExtractingRequestHandler.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724858#action_12724858 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

Oops, there is a "ext.metadata.prefix" that I missed on the first pass.  This should be defaulted to handle unknown attributes.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653137#action_12653137 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

I think in multicore you can specify a shared library in the solr.xml directory, so you could put the tika stuff in that dir.


As for the tests, I didn't know the tests had a dependency on the example directory.  That doesn't seem good.  I'm with a client all this week, but will try to get to it this weekend.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654235#action_12654235 ] 

Grant Ingersoll commented on SOLR-284:
--------------------------------------

Thanks, Ryan, I will remove them.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756154#action_12756154 ] 

Chris Harris commented on SOLR-284:
-----------------------------------

This caught me by surprise, so I'm noting it here in case it helps anyone else:

In SVN r815830 (September 16, 2009), Grant renamed the field name mapping argument "map" to "fmap". The reason was to make naming more consistent with the CSV handler. For more info on this see the following thread:

http://www.nabble.com/Fwd%3A-CSV-Update---Need-help-mapping-csv-field-to-schema%27s-ID-td25463942.html



> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.