You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Emmanuel Espina (Created) (JIRA)" <ji...@apache.org> on 2012/03/14 18:04:41 UTC

[jira] [Created] (SOLR-3246) UpdateRequestProcessor to extract Solr XML from rich documents

UpdateRequestProcessor to extract Solr XML from rich documents
--------------------------------------------------------------

                 Key: SOLR-3246
                 URL: https://issues.apache.org/jira/browse/SOLR-3246
             Project: Solr
          Issue Type: New Feature
          Components: update
            Reporter: Emmanuel Espina
            Priority: Minor


This would be an update request handler to save a file with the xml that represents the document in an external directory. The original
idea behind this was to add it to the processing chain of the ExtractingRequestHandler to store an already parsed version of the docs. This storage of pre-parsed documents will make the re indexing of the entire index faster (avoiding the Tika phase, and just sending the xml to the standard update processor).
As a side effect, extracting the xml can make debugging of rich docs easier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-3246) UpdateRequestProcessor to extract Solr XML from rich documents

Posted by "Jan Høydahl (Commented JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229747#comment-13229747 ] 

Jan Høydahl commented on SOLR-3246:
-----------------------------------

We wrote a data dumper in a project as a patched ExtractingUpdateRequestHandler. It writes a CSV format (including Base64 encoded binary input) to one file. We were thinking about rewriting it as an UpdateProcessor, which will then work much like yours. The benefit with CSV format is that it is much more compact. Also, a file system may kneal with too many files in a folder.
                
> UpdateRequestProcessor to extract Solr XML from rich documents
> --------------------------------------------------------------
>
>                 Key: SOLR-3246
>                 URL: https://issues.apache.org/jira/browse/SOLR-3246
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Emmanuel Espina
>            Priority: Minor
>         Attachments: SOLR-3246.patch
>
>
> This would be an update request handler to save a file with the xml that represents the document in an external directory. The original
> idea behind this was to add it to the processing chain of the ExtractingRequestHandler to store an already parsed version of the docs. This storage of pre-parsed documents will make the re indexing of the entire index faster (avoiding the Tika phase, and just sending the xml to the standard update processor).
> As a side effect, extracting the xml can make debugging of rich docs easier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-3246) UpdateRequestProcessor to extract Solr XML from rich documents

Posted by "Emmanuel Espina (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238798#comment-13238798 ] 

Emmanuel Espina edited comment on SOLR-3246 at 3/26/12 8:42 PM:
----------------------------------------------------------------

Added some changes to let the user select the format of the output. In the patch there is only a XML writer, but others like the CSV can be added.
                
      was (Author: emmanuel.espina):
    Added some changes to let the user select the format of the output. In the patch there is only a XML writter, but others like the CSV can be added.
                  
> UpdateRequestProcessor to extract Solr XML from rich documents
> --------------------------------------------------------------
>
>                 Key: SOLR-3246
>                 URL: https://issues.apache.org/jira/browse/SOLR-3246
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Emmanuel Espina
>            Priority: Minor
>         Attachments: SOLR-3246.patch, SOLR-3246.patch
>
>
> This would be an update request handler to save a file with the xml that represents the document in an external directory. The original
> idea behind this was to add it to the processing chain of the ExtractingRequestHandler to store an already parsed version of the docs. This storage of pre-parsed documents will make the re indexing of the entire index faster (avoiding the Tika phase, and just sending the xml to the standard update processor).
> As a side effect, extracting the xml can make debugging of rich docs easier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-3246) UpdateRequestProcessor to extract Solr XML from rich documents

Posted by "Emmanuel Espina (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Emmanuel Espina updated SOLR-3246:
----------------------------------

    Attachment: SOLR-3246.patch

Initial code for this component (with a very simple test)
                
> UpdateRequestProcessor to extract Solr XML from rich documents
> --------------------------------------------------------------
>
>                 Key: SOLR-3246
>                 URL: https://issues.apache.org/jira/browse/SOLR-3246
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Emmanuel Espina
>            Priority: Minor
>         Attachments: SOLR-3246.patch
>
>
> This would be an update request handler to save a file with the xml that represents the document in an external directory. The original
> idea behind this was to add it to the processing chain of the ExtractingRequestHandler to store an already parsed version of the docs. This storage of pre-parsed documents will make the re indexing of the entire index faster (avoiding the Tika phase, and just sending the xml to the standard update processor).
> As a side effect, extracting the xml can make debugging of rich docs easier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-3246) UpdateRequestProcessor to extract Solr XML from rich documents

Posted by "Emmanuel Espina (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229468#comment-13229468 ] 

Emmanuel Espina commented on SOLR-3246:
---------------------------------------

This is similar to https://issues.apache.org/jira/browse/SOLR-903
But this would be a server side component.
                
> UpdateRequestProcessor to extract Solr XML from rich documents
> --------------------------------------------------------------
>
>                 Key: SOLR-3246
>                 URL: https://issues.apache.org/jira/browse/SOLR-3246
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Emmanuel Espina
>            Priority: Minor
>
> This would be an update request handler to save a file with the xml that represents the document in an external directory. The original
> idea behind this was to add it to the processing chain of the ExtractingRequestHandler to store an already parsed version of the docs. This storage of pre-parsed documents will make the re indexing of the entire index faster (avoiding the Tika phase, and just sending the xml to the standard update processor).
> As a side effect, extracting the xml can make debugging of rich docs easier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-3246) UpdateRequestProcessor to extract Solr XML from rich documents

Posted by "Emmanuel Espina (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231457#comment-13231457 ] 

Emmanuel Espina commented on SOLR-3246:
---------------------------------------

Probably the output format could be set in a similar way to how it's done with the response writers. In that way the XMLWritingUpdateProcessor would be just WritingUpdateProcessor and the writer can be selected with a parameter in the configuration, having a default (being that xml or csv). That would be:

<updateRequestProcessorChain name="writer">
    <processor class="org.apache.solr.update.processor.WritingUpdateProcessorFactory">
      <str name="outputDir">"./dacDumps</str>
      <str name="writer">xml</str>
      <str name="groupFiles">100</str>
    </processor>
</updateRequestProcessorChain>

Also with another parameter one could select to add to the same file one, n or unlimited documents. 
                
> UpdateRequestProcessor to extract Solr XML from rich documents
> --------------------------------------------------------------
>
>                 Key: SOLR-3246
>                 URL: https://issues.apache.org/jira/browse/SOLR-3246
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Emmanuel Espina
>            Priority: Minor
>         Attachments: SOLR-3246.patch
>
>
> This would be an update request handler to save a file with the xml that represents the document in an external directory. The original
> idea behind this was to add it to the processing chain of the ExtractingRequestHandler to store an already parsed version of the docs. This storage of pre-parsed documents will make the re indexing of the entire index faster (avoiding the Tika phase, and just sending the xml to the standard update processor).
> As a side effect, extracting the xml can make debugging of rich docs easier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-3246) UpdateRequestProcessor to extract Solr XML from rich documents

Posted by "Emmanuel Espina (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Emmanuel Espina updated SOLR-3246:
----------------------------------

    Attachment: SOLR-3246.patch

Added some changes to let the user select the format of the output. In the patch there is only a XML writter, but others like the CSV can be added.
                
> UpdateRequestProcessor to extract Solr XML from rich documents
> --------------------------------------------------------------
>
>                 Key: SOLR-3246
>                 URL: https://issues.apache.org/jira/browse/SOLR-3246
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Emmanuel Espina
>            Priority: Minor
>         Attachments: SOLR-3246.patch, SOLR-3246.patch
>
>
> This would be an update request handler to save a file with the xml that represents the document in an external directory. The original
> idea behind this was to add it to the processing chain of the ExtractingRequestHandler to store an already parsed version of the docs. This storage of pre-parsed documents will make the re indexing of the entire index faster (avoiding the Tika phase, and just sending the xml to the standard update processor).
> As a side effect, extracting the xml can make debugging of rich docs easier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org