You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by "Hoss Man (Confluence)" <co...@apache.org> on 2013/07/13 01:46:00 UTC

[CONF] Apache Solr Reference Guide > Uploading Data with Solr Cell using Apache Tika

Space: Apache Solr Reference Guide (https://cwiki.apache.org/confluence/display/solr)
Page: Uploading Data with Solr Cell using Apache Tika (https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika)


Edited by Hoss Man:
---------------------------------------------------------------------
{section}
{column:width=75%}
Solr uses code from the [Apache Tika|http://lucene.apache.org/tika/] project to provide a framework for incorporating many different file-format parsers such as [Apache PDFBox|http://incubator.apache.org/pdfbox/] and [Apache POI|http://poi.apache.org/index.html] into Solr itself. Working with this framework, Solr's {{ExtractingRequestHandler}} can use Tika to support uploading binary files, including files in popular formats such as Word and PDF, for data extraction and indexing.

{info}
As of version 4.4, Solr uses Apache Tika v1.4.
{info}

When this framework was under development, it was called the Solr Content Extraction Library or CEL; from that abbreviation came this framework's name: Solr Cell.

If you want to supply your own [ContentHandler|http://wiki.apache.org/solr/ContentHandler] for Solr to use, you can extend the ExtractingRequestHandler and  override the {{createFactory()}} method.  This factory is responsible for  constructing the [SolrContentHandler|http://wiki.apache.org/solr/SolrContentHandler] that interacts with Tika, and allows literals to override Tika-parsed values. Set the parameter {{literalsOverride}}, which normally defaults to \*true, to \*false to append Tika-parsed values to literal values.

For more information on Solr's Extracting Request Handler, see [https://wiki.apache.org/solr/ExtractingRequestHandler].
{column}

{column:width=25%}
{panel}
Topics covered in this section:
{toc:minLevel=2|maxLevel=2}
{panel}
{column}
{section}

h2. Key Concepts

When using the Solr Cell framework, it is helpful to keep the following in mind:

* Tika will automatically attempt to determine the input document type (Word, PDF, HTML) and extract the content appropriately. If you like, you can explicitly specify a MIME type for Tika with the {{stream.type}} parameter.
\\
\\
* Tika works by producing an XHTML stream that it feeds to a SAX ContentHandler. SAX is a common interface implemented for many different XML parsers. For more information, see [http://www.saxproject.org/quickstart.html].
\\
\\
* Solr then responds to Tika's SAX events and creates the fields to index.
\\
\\
* Tika produces metadata such as Title, Subject, and Author according to specifications such as the DublinCore. See [http://tika.apache.org/1.0/formats.html] for the file types supported.
\\
\\
* Tika adds all the extracted text to the {{content}} field. This field is defined as "stored" in {{schema.xml}}. It is also copied to the {{text}} field with a {{copyField}} rule.
\\
\\
* You can map Tika's metadata fields to Solr fields. You can also boost these fields.
\\
\\
* You can pass in literals for field values. Literals will override Tika-parsed values, including fields in the Tika  metadata object, the Tika content field, and any "captured  content" fields.
\\
\\
* You can apply an XPath expression to the Tika XHTML to restrict the content that is produced.

{tip}
While Apache Tika is quite powerful, it is not perfect and fails on some files. PDF files are particularly problematic, mostly due to the PDF format itself. In case of a failure processing any file, the {{ExtractingRequestHandler}} does not have a secondary mechanism to try to extract some text from the file; it will throw an exception and fail.
{tip}

h2. Trying out Tika with the Solr Example Directory

You can try out the Tika framework using the example application included in Solr.

Start the Solr example server:

{code:language=none|borderStyle=solid|borderColor=#666666}
cd example -jar start.jar
{code}

In a separate window go to the {{docs/}} directory (which contains some nice example docs), or the site directory if you built Solr from source, and send Solr a file via HTTP POST:

{code:language=none|borderStyle=solid|borderColor=#666666}
curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F "myfile=@tutorial.html"
{code}

The URL above calls the Extraction Request Handler, uploads the file {{tutorial.html}} and assigns it the unique ID {{doc1}}. Here's a closer look at the components of this command:

* The {{literal.id=doc1}} parameter provides the necessary unique ID for the document being indexed.

* The {{commit=true parameter}} causes Solr to perform a commit after indexing the document, making it immediately searchable. For optimum performance when loading many documents, don't call the commit command until you are done.

* The {{\-F}} flag instructs curl to POST data using the Content-Type {{multipart/form-data}} and supports the uploading of binary files. The @ symbol instructs curl to upload the attached file.

* The argument {{myfile=@tutorial.html}} needs a valid path, which can be absolute or relative (for example, {{myfile=@../../site/tutorial.html}} if you are still in exampledocs directory).

Now you should be able to execute a query and find that document (open the following link in your browser): [http://localhost:8983/solr/select?q=tutorial].

You may notice that although you can search on any of the text in the sample document, you may not be able to see that text when the document is retrieved. This is simply because the "content" field generated by Tika is mapped to the Solr field called {{text}}, which is indexed but not stored. This operation is controlled by default map rule in the {{/update/extract}} handler in {{solrconfig.xml}}, and its behavior can be easily changed or overridden. For example, to store and see all metadata and content, execute the following:

{code:language=none|borderStyle=solid|borderColor=#666666}
curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true' -F "myfile=@tutorial.html"
{code}

In this command, the {{uprefix=attr\_}} parameter causes all generated fields that aren't defined in the schema to be prefixed with {{attr\_}}, which is a dynamic field that is stored.

The {{fmap.content=attr_content}} parameter overrides the default {{fmap.content=text}} causing the content to be added to the {{attr_content}} field instead.

Then run this command to query the document: [http://localhost:8983/solr/select?q=attr_content:tutorial]

h2. Input Parameters

The table below describes the parameters accepted by the Extraction Request Handler.

|| Parameter || Description ||
| boost.<_fieldname_> | Boosts the specified field by the defined float amount. (Boosting a field alters its importance in a query response. To learn about boosting fields, see [Searching].) |
| capture | Captures XHTML elements with the specified name for a supplementary addition to the Solr document. This parameter can be useful for copying chunks of the XHTML into a separate field. For instance, it could be used to grab paragraphs ({{<p>}}) and index them into a separate field. Note that content is still also captured into the overall "content" field. |
| captureAttr | Indexes attributes of the Tika XHTML elements into separate fields, named after the element. If set to true, for example, when extracting from HTML, Tika can return the href attributes in <a> tags as fields named "a". See the examples below. |
| commitWithin | Add the document within the specified number of milliseconds. |
| date.formats | Defines the date format patterns to identify in the documents. |
| defaultField | If the uprefix parameter (see below) is not specified and a field cannot be determined, the default field will be used. |
| extractOnly | Default is false. If true, returns the extracted content from Tika without indexing the document. This literally includes the extracted XHTML as a string in the response. When viewing manually, it may be useful to use a response format other than XML to aid in viewing the embedded XHTML tags.For an example, see [http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput]. |
| extractFormat | Default is "xml", but the other option is "text". Controls the serialization format of the extract content. The xml format is actually XHTML, the same format that results from passing the {{\-x}} command to the Tika command line application, while the text format is like that produced by Tika's {{\-t}} command. This parameter is valid only if {{extractOnly}} is set to true. |
| fmap.<_source_field_> | Maps (moves) one field name to another. The {{source_field}} must be a field in incoming documents, and the value is the Solr field to map to. Example: {{fmap.content=text}} causes the data in the {{content}} field generated by Tika to be moved to the Solr's {{text}} field. |
| literal.<_fieldname_> | Populates a field with the name supplied with the specified value for each document. The data can be multivalued if the field is multivalued. |
| literalsOverride | If true (the default), literal field values will override other values with the same field name. If false, literal values defined with {{literal.<_fieldname_>}} will be appended to data already in the fields extracted from Tika. If setting {{literalsOverride}} to "false", the field must be multivalued. |
| lowernames | Values are "true" or "false". If true, all field names will be mapped to lowercase with underscores, if needed. For example, "Content-Type" would be mapped to "content_type." |
| multipartUploadLimitInKB | Useful if uploading very large documents, this defines the KB size of documents to allow. |
| passwordsFile | Defines a file path and name for a file of file name to password mappings. |
| resource.name | Specifies the optional name of the file. Tika can use it as a hint for detecting a file's MIME type. |
| resource.password | Defines a password to use for a password-protected PDF or OOXML file |
| tika.config | Defines a file path and name to a customized Tika configuration file. This is only required if you have customized your Tika implementation. |
| uprefix | Prefixes all fields that are not defined in the schema with the given prefix. This is very useful when combined with dynamic field definitions. Example: {{uprefix=ignored\_}} would effectively ignore all unknown fields generated by Tika given the example schema contains {{<dynamicField&nbsp;name="ignored_*"&nbsp;type="ignored"/>}} |
| xpath | When extracting, only return Tika XHTML content that satisfies the given XPath expression. See [http://tika.apache.org/1.0/index.html] for details on the format of Tika XHTML. See also [http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput]. |

h2. Order of Operations

Here is the order in which the Solr Cell framework, using the Extraction Request Handler and Tika, processes its input.

# Tika generates fields or passes them in as literals specified by {{literal.<fieldname>=<value>}}. If {{literalsOverride=false}}, literals will be appended as multi-value to the Tika-generated field.
# If {{lowernames=true}}, Tika maps fields to lowercase.
# Tika applies the mapping rules specified by {{fmap.}}{{{}{_}source{_}{}}}{{=}}{{{}{_}target{_}}} parameters.
# If {{uprefix}} is specified, any unknown field names are prefixed with that value, else if {{defaultField}} is specified, any unknown fields are copied to the default field.

h2. Configuring the Solr {{ExtractingRequestHandler}}

If you are not working in the supplied {{example/solr}} directory, you must copy all libraries from {{example/solr/libs}} into a {{libs}} directory within your own solr directory or to a directory you've specified in {{solrconfig.xml}} using the new {{libs}} directive. The {{ExtractingRequestHandler}} is not incorporated into the Solr WAR file, so you have to install it separately.

Here is an example of configuring the {{ExtractingRequestHandler}} in {{solrconfig.xml}}.

{code:xml|borderStyle=solid|borderColor=#666666}
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="fmap.Last-Modified">last_modified</str>
      <str name="uprefix">ignored_</str>
    </lst>
    <!--Optional.  Specify a path to a tika configuration file. See the Tika docs for details.-->
    <str name="tika.config">/my/path/to/tika.config</str>
    <!-- Optional. Specify one or more date formats to parse. See DateUtil.DEFAULT_DATE_FORMATS
         for default date formats -->
    <lst name="date.formats">
      <str>yyyy-MM-dd</str>
    </lst>
  </requestHandler>
{code}

In the defaults section, we are mapping Tika's Last-Modified Metadata attribute to a field named {{last_modified}}. We are also telling it to ignore undeclared fields. These are all overridden parameters.

The {{tika.config}} entry points to a file containing a Tika configuration. The {{date.formats}} allows you to specify various {{java.text.SimpleDateFormats}} date formats for working with transforming extracted input to a Date. Solr comes configured with the following date formats (see the {{DateUtil}} in Solr):

{{yyyy-MM-dd'T'HH:mm:ss'Z'}}
{{yyyy-MM-dd'T'HH:mm:ss}}
{{yyyy-MM-dd}}
{{yyyy-MM-dd hh:mm:ss}}
{{yyyy-MM-dd HH:mm:ss}}
{{EEE MMM d hh:mm:ss z yyyy}}
{{EEE, dd MMM yyyy HH:mm:ss zzz}}
{{EEEE, dd-MMM-yy HH:mm:ss zzz}}
{{EEE MMM d HH:mm:ss yyyy}}

You may also need to adjust the {{multipartUploadLimitInKB}} attribute as follows if you are submitting very large documents.

{code:xml|borderStyle=solid|borderColor=#666666}
  <requestDispatcher handleSelect="true" >
    <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="20480" />
    ...
{code}

h3. Multi-Core Configuration

For a multi-core configuration, specify {{sharedLib='lib'}} in the {{<solr/>}} section of {{solr.xml}} in order for Solr to find the JAR files in {{example/solr/lib}}.

For more information about Solr cores, see [The Well-Configured Solr Instance|The Well-Configured Solr Instance].

h2. Indexing Encrypted Documents with the ExtractingUpdateRequestHandler

The ExtractingRequestHandler will decrypt encrypted files and index their content if you supply a password in either {{resource.password}} on the request, or in a {{passwordsFile}} file. 

In the case of {{passwordsFile}}, the file supplied must be formatted so there is one line per rule. Each rule contains a file name regular expression, followed by "=", then the password in clear-text. Because the passwords are in clear-text, the file should have strict access restrictions. 

{code:language=none|borderStyle=solid|borderColor=#666666}
# This is a comment
myFileName = myPassword
.*\.docx$ = myWordPassword
.*\.pdf$ = myPdfPassword
{code}

h2. Examples

h3. Metadata

As mentioned before, Tika produces metadata about the document. Metadata describes different aspects of a document, such as the author's name, the number of pages, the file size, and so on. The metadata produced depends on the type of document submitted. For instance, PDFs have different metadata than Word documents do.

In addition to Tika's metadata, Solr adds the following metadata (defined in {{ExtractingMetadataConstants}}):

|| Solr Metadata || Description ||
| stream_name | The name of the Content Stream as uploaded to Solr. Depending on how the file is uploaded, this may or may not be set |
| stream_source_info | Any source info about the stream. (See the section on Content Streams later in this section.) |
| stream_size | The size of the stream in bytes. |
| stream_content_type | The content type of the stream, if available. |

{note}
We recommend that you try using the {{extractOnly}} option to discover which values Solr is setting for these metadata elements.
{note}

h3. Examples of Uploads Using the Extraction Request Handler

h4. Capture and Mapping

The command below captures {{<div>}} tags separately, and then maps all the instances of that field to a dynamic field named {{foo_t}}.

{code:xml|borderStyle=solid|borderColor=#666666}
curl "http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div" -F "tutorial=@tutorial.pdf"
{code}

h4. Capture, Mapping, and Boosting

The command below captures {{<div>}} tags separately, maps the field to a dynamic field named {{foo_t}}, then boosts {{foo_t}} by 3.

{code:language=none|borderStyle=solid|borderColor=#666666}
curl "http://localhost:8983/solr/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3" -F "tutorial=@tutorial.pdf"
{code}

h4. Using Literals to Define Your Own Metadata

To add in your own metadata, pass in the literal parameter along with the file:

{code:language=none|borderStyle=solid|borderColor=#666666}
curl "http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah" -F "tutorial=@tutorial.pdf"
{code}

h4. XPath

The example below passes in an XPath expression to restrict the XHTML returned by Tika:

{code:language=none|borderStyle=solid|borderColor=#666666}
curl "http://localhost:8983/solr/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()" -F "tutorial=@tutorial.pdf"
{code}

h3. Extracting Data without Indexing It

Solr allows you to extract data without indexing. You might want to do this if you're using Solr solely as an extraction server or if you're interested in testing Solr extraction.

The example below sets the {{extractOnly=true parameter}} to extract data without indexing it.

{code:xml|borderStyle=solid|borderColor=#666666}
curl "http://localhost:8983/solr/update/extract?&extractOnly=true" --data-binary @tutorial.html -H 'Content-type:text/html'
{code}

The output includes XML generated by Tika (and further escaped by Solr's XML) using a different output format to make it more readable:

{code:xml|borderStyle=solid|borderColor=#666666}
curl "http://localhost:8983/solr/update/extract?&extractOnly=true&wt=ruby&indent=true" --data-binary @tutorial.html -H 'Content-type:text/html'
{code}

h2. Sending Documents to Solr with a POST

The example below streams the file as the body of the POST, which does not, then, provide information to Solr about the name of the file.

{code:xml|borderStyle=solid|borderColor=#666666}
curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text" --data-binary @tutorial.html -H 'Content-type:text/html'
{code}

h2. Sending Documents to Solr with Solr Cell and SolrJ

SolrJ is a Java client that you can use to add documents to the index, update the index, or query the index. You'll find more information on SolrJ in [Client APIs].

Here's an example of using Solr Cell and SolrJ to add documents to a Solr index.

First, let's use SolrJ to create a new SolrServer, then we'll construct a request containing a ContentStream (essentially a wrapper around a file) and sent it to Solr:

{code:language=java|borderStyle=solid|borderColor=#666666}
public class SolrCellRequestDemo {
  public static void main (String[] args){color} throws IOException, SolrServerException {
    SolrServer server = new HttpSolrServer("http://localhost:8983/solr");
    ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
    req.addFile(new File("apache-solr/site/features.pdf"));
    req.setParam(ExtractingParams.EXTRACT_ONLY, "true");
    NamedList&lt;Object&gt; result = server.request(req);
    System.out.println("Result: " + result);
}
{code}

This operation streams the file {{features.pdf}} into the Solr index.

The sample code above calls the extract command, but you can easily substitute other commands that are supported by Solr Cell. The key class to use is the {{ContentStreamUpdateRequest}}, which makes sure the ContentStreams are set properly. SolrJ takes care of the rest.

Note that the {{ContentStreamUpdateRequest}} is not just specific to Solr Cell. You can send CSV to the CSV Update handler and to any other Request Handler that works with Content Streams for updates.

h2. Related Topics

* [ExtractingRequestHandler|http://wiki.apache.org/solr/ExtractingRequestHandler]


{scrollbar}


Stop watching space: https://cwiki.apache.org/confluence/users/removespacenotification.action?spaceKey=solr
Change email notification preferences: https://cwiki.apache.org/confluence/users/editmyemailsettings.action