You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Fatih Pazarbasi (Jira)" <ji...@apache.org> on 2021/08/13 06:22:00 UTC

[jira] [Created] (TIKA-3523) A replacement for enableFileUrl or Support for Google Cloud

Fatih Pazarbasi created TIKA-3523:
-------------------------------------

             Summary: A replacement for enableFileUrl or Support for Google Cloud
                 Key: TIKA-3523
                 URL: https://issues.apache.org/jira/browse/TIKA-3523
             Project: Tika
          Issue Type: Wish
          Components: tika-server
    Affects Versions: 2.0.0
            Reporter: Fatih Pazarbasi


Hello,

I have a setup where users upload their files to a cloud bucket and I forward the fileUrl to make ocr on them in a serverless cloud instance. I do it this way so the users do not contact with the Tika Server and I have a copy of what they've sent to process it. Also they have nothing to do with the unprocessed response.

Now that you've removed the enableFileUrl... I have to download the files to the backend instance from the cloud bucket they have uploaded their files to, and put them to /tika server back again...

I tried the following config.xml to work around the situation but it was in vain...
 For the made up url: [https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/somefilethatdoesnotexist.pdf|https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/]
{code:java}
<fetchers> 
 <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> 
  <params> 
   <name>fsf</name> 
   <basePath>https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o</basePath> 
  </params> 
 </fetcher> 
</fetchers> 
<emitters> 
 <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter"> 
  <params> 
   <name>fse</name> 
   <basePath>gs://abcd-efgh.appspot.com/users</basePath> 
  </params> 
 </emitter> 
</emitters> 
<server> 
 <params> 
  <enableUnsecureFeatures>true</enableUnsecureFeatures> 
 </params> 
</server> 
<pipes> 
 <params> 
  <tikaConfig>/path/to/tika-config.xml</tikaConfig> 
 </params> 
</pipes>{code}
{code:java}
headers: {         
Accept: 'text/plain',         
'User-Agent': 'Firebase Functions',         
fetcherName: 'fsf',         
fetchKey: 'somefilethatdoesnotexist.pdf',   
},{code}
It doesn't support the gs:// Google Storage bucket either. I have all the necessary permissions but it didn't help.
 
In the golden times of 1.2x Iwas simply using:
 
{code:java}
headers: {               
Accept: 'text/plain',               
'User-Agent': 'Firebase Functions',               
fileUrl: 'https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/somefilethatdoesnotexist.pdf',             
},{code}
 
 
Am I missing something?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)