You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2015/05/10 00:25:24 UTC

[Tika Wiki] Update of "TikaJAXRS" by HaydenYoung

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "TikaJAXRS" page has been changed by HaydenYoung:
https://wiki.apache.org/tika/TikaJAXRS?action=diff&rev1=37&rev2=38

Comment:
Add documentation about new fileUrl header option for extracting remote files.

  {{{
  java -jar tika-server-x.x.jar --host=intranet.local --port=12345
  }}}
- 
  Once the server is running, you can visit the server's URL in your browser (eg {{{http://localhost:9998/}}}), and the basic welcome page will confirm that the Server is running, and give links to the various endpoints available.
  
  Below is some basic documentation on how to  interact with the services using cURL and HTTP.
  
  == Using prebuilt Docker image ==
- Also, you can download and start it with 
+ Also, you can download and start it with
  
  {{{
  docker pull logicalspark/docker-tikaserver # only on initial download/update
@@ -97, +96 @@

  "Content-Encoding","ISO-8859-2"
  "Content-Type","text/plain"
  }}}
- 
  Get metadata as JSON:
+ 
  {{{
  $ curl -T test_recursive_embedded.docx http://localhost:9998/meta --header "Accept: application/json"
  }}}
- 
  Or XMP:
  
  {{{
  $ curl -T test_recursive_embedded.docx http://localhost:9998/meta --header "Accept: application/rdf+xml"
  }}}
- 
- 
  Get specific metadata key's value as simple text string:
+ 
  {{{
  $ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: text/plain"
  }}}
- 
  Returns:
+ 
  {{{
  application/vnd.openxmlformats-officedocument.wordprocessingml.document
  }}}
- 
- 
  Get specific metadata key's value(s) as CSV:
+ 
  {{{
  $ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: text/csv"
  }}}
- 
  Or JSON:
+ 
  {{{
  $ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: application/json"
  }}}
- 
  Or XMP:
+ 
  {{{
  $ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: application/rdf+xml"
  }}}
- 
  '''Note: when requesting specific metadata keys value(s) in XMP, make sure to request the XMP name, e.g. "dc:creator" vs. "Author" '''
  
  == Tika Resource ==
@@ -179, +174 @@

  {{{
  $ curl -X PUT -H "Content-Disposition: attachment; filename=foo.csv" --upload-file foo.csv http://localhost:9998/detect/stream
  }}}
- 
  == Language Resource ==
  {{{
  /language/stream
  }}}
- HTTP PUTs or POSTs a document to the LanguageIdentifier to identify its language. 
+ HTTP PUTs or POSTs a document to the LanguageIdentifier to identify its language.
  
  Default return is a string of the 2 character identified language.
  
@@ -195, +189 @@

  $ curl -X PUT --data-binary @foo.txt http://localhost:9998/language/stream
  en
  }}}
- 
  == PUT a TXT file with French comme çi comme ça and get back fr ==
  {{{
  curl -X PUT --data-binary @foo.txt http://localhost:9998/language/stream
  fr
  }}}
- 
  {{{
  /language/string
  }}}
@@ -216, +208 @@

  $ curl -X PUT --data "This is English!" http://localhost:9998/language/string
  en
  }}}
- 
  == PUT a string with French comme çi comme ça and get back fr ==
  {{{
  curl -X PUT --data "comme çi comme ça" http://localhost:9998/language/string
  fr
  }}}
- 
  == Translate Resource ==
  {{{
  /translate/all/translator/src/dest
@@ -231, +221 @@

  
  Default return is the translated string if successful, else the original string back.
  
+ Note that: * *translator* should be a fully qualified Tika class name (with package) e.g., org.apache.tika.language.translate.Lingo24Translator * *src* should be the 2 character short code for the source language, e.g., 'en' for English * *dest* should be the 2 character short code for the dest language, e.g., 'es' for Spanish.
- Note that:
- * *translator* should be a fully qualified Tika class name (with package) e.g., org.apache.tika.language.translate.Lingo24Translator
- * *src* should be the 2 character short code for the source language, e.g., 'en' for English
- * *dest* should be the 2 character short code for the dest language, e.g., 'es' for Spanish.
  
  Some Example calls with cURL:
  
@@ -243, +230 @@

  $ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.Lingo24Translator/es/en
  lack of practice in Spanish
  }}}
- 
  == PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Microsoft ==
  {{{
  $ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.MicrosoftTranslator/es/en
  I need practice in Spanish
  }}}
- 
  == PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Google ==
  {{{
  $ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.GoogleTranslator/es/en
  I need practice in Spanish
  }}}
- 
  {{{
  /translate/all/src/dest
  }}}
@@ -263, +247 @@

  
  Default return is the translated string if successful, else the original string back.
  
+ Note that: * *translator* should be a fully qualified Tika class name (with package) e.g., org.apache.tika.language.translate.Lingo24Translator * *dest* should be the 2 character short code for the dest language, e.g., 'es' for Spanish.
- Note that:
- * *translator* should be a fully qualified Tika class name (with package) e.g., org.apache.tika.language.translate.Lingo24Translator
- * *dest* should be the 2 character short code for the dest language, e.g., 'es' for Spanish.
  
  == PUT a TXT file named sentences2 with French comme çi comme ça and get back the English translation using Google auto-detecting the language ==
  {{{
  $ curl -X PUT --data-binary @sentences2 http://localhost:9998/translate/all/org.apache.tika.language.translate.GoogleTranslator/en
  so so
  }}}
- 
  == Recursive Metadata and Content ==
  {{{
  /rmeta
  }}}
+ Returns a JSONified list of Metadata objects for the container document and all embedded documents. The text that is extracted from each document is stored in the metadata object under "X-TIKA:content".
- 
- Returns a JSONified list of Metadata objects for the container document and all embedded documents.
- The text that is extracted from each document is stored in the metadata object under "X-TIKA:content".
  
  {{{
  $ curl -T test_recursive_embedded.docx http://localhost:9998/rmeta
  }}}
- 
  Returns:
+ 
  {{{
  [
   {"Application-Name":"Microsoft Office Word",
@@ -335, +314 @@

  /
  }}}
  Hitting the route of the server in your web browser will give a basic report of all the endpoints defined in the server, what URL they have etc
+ 
  == Defined Mime Types ==
  {{{
  /mime-types
@@ -359, +339 @@

  List all the available parsers, along with what mimetypes they support
  
  = Extracting A Document From A URL =
- It is possible to use a remote file with TikaJAXRS by downloading it via its URL first then piping it to the appropriate service:
+ Remote files can be PUT to Tika Server using the header "fileUrl":
  
  {{{
- $ curl -s "http://url/to/my.file" | curl -X PUT -T - http://localhost:9998/meta
+ $ curl -i -H "fileUrl:http://url/to/my.file" -H "Accept: application/json" -X PUT http://localhost:9998/meta
- $ curl -s "http://url/to/my.file" | curl -X PUT -T - http://localhost:9998/tika
+ $ curl -i -H "fileUrl:http://url/to/my.file" -H "Accept: text/plain" -X PUT http://localhost:9998/tika
  }}}
- The caveat with above is that it fetches the entire file, so large files such as video can take some time to download. Therefore, you may wish to use curl to get preliminary information (content type, name and size) about the file before you proceed:
+ NOTE: Each PUT will download the entire file from the remote source.
  
- {{{
- $ curl -I http://url/to/my.file
- }}}
- If the file should be parsed (E.g. you only want to get information about mp3s, mp4s and PDFs), send it on to TikaJAXRS.
-