You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Fatih Pazarbasi (Jira)" <ji...@apache.org> on 2022/02/09 19:56:00 UTC

[jira] [Commented] (TIKA-3523) A replacement for enableFileUrl or Support for Google Cloud

    [ https://issues.apache.org/jira/browse/TIKA-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489780#comment-17489780 ] 

Fatih Pazarbasi commented on TIKA-3523:
---------------------------------------

Hello again.

I need to say that tika-config.xml solution keeps giving me errors. 
{panel:title=tika-config.xml}
 
<?xml version="1.0" encoding="UTF-8" ?>
 
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser"/>
    </parsers>
    <fetchers>
        <fetcher class="org.apache.tika.pipes.fetcher.http.HttpFetcher">
            <params>
                <name>http</name>
            </params>
        </fetcher>
    </fetchers>
    <server>
        <params>
            <enableUnsecureFeatures>true</enableUnsecureFeatures>
        </params>
    </server>
</properties>
 
{panel}
 
 

With this [https://cwiki.apache.org/confluence/display/TIKA/tika-pipes+and+Docker] ... This error

{{}}
{code:java}
java.nio.file.NoSuchFileException: C:/Program
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219)
at java.base/java.nio.file.Files.newByteChannel(Files.java:380)
at java.base/java.nio.file.Files.newByteChannel(Files.java:432)
at java.base/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422)
at java.base/java.nio.file.Files.newInputStream(Files.java:160)
at org.apache.tika.server.core.TikaServerConfig.load(TikaServerConfig.java:176)
at org.apache.tika.server.core.TikaServerConfig.load(TikaServerConfig.java:134)
at org.apache.tika.server.core.TikaServerCli.execute(TikaServerCli.java:83)
at org.apache.tika.server.core.TikaServerCli.main(TikaServerCli.java:66)
ERROR [main] 19:53:04,124 org.apache.tika.server.core.TikaServerCli Can't start: 
java.nio.file.NoSuchFileException: C:/Program
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219) ~[?:?]
at java.nio.file.Files.newByteChannel(Files.java:380) ~[?:?]
at java.nio.file.Files.newByteChannel(Files.java:432) ~[?:?]
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422) ~[?:?]
at java.nio.file.Files.newInputStream(Files.java:160) ~[?:?]
at org.apache.tika.server.core.TikaServerConfig.load(TikaServerConfig.java:176) ~[tika-server-standard-2.0.0.jar:2.0.0]
at org.apache.tika.server.core.TikaServerConfig.load(TikaServerConfig.java:134) ~[tika-server-standard-2.0.0.jar:2.0.0]
at org.apache.tika.server.core.TikaServerCli.execute(TikaServerCli.java:83) ~[tika-server-standard-2.0.0.jar:2.0.0]
at org.apache.tika.server.core.TikaServerCli.main(TikaServerCli.java:66) [tika-server-standard-2.0.0.jar:2.0.0]
{code}
{{}}


and with apache/tika 2.2.1... This error:
 
 

{{}}
{code:java}
INFO  [main] 19:24:09,290 org.apache.tika.server.core.TikaServerProcess Starting Apache Tika 2.2.1 server
INFO  [main] 19:24:09,384 org.apache.tika.server.core.TikaServerProcess Using custom config: /tika-config.xml
ERROR [main] 19:24:09,495 org.apache.tika.server.core.TikaServerProcess Can't start: 
org.apache.tika.exception.TikaConfigException: problem loading fetcher
at org.apache.tika.config.ConfigBase.buildClass(ConfigBase.java:203) ~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.config.ConfigBase.loadComposite(ConfigBase.java:178) ~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.config.ConfigBase.buildComposite(ConfigBase.java:151) ~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.config.ConfigBase.buildComposite(ConfigBase.java:132) ~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.pipes.fetcher.FetcherManager.load(FetcherManager.java:42) ~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.server.core.TikaServerProcess.initServer(TikaServerProcess.java:214) ~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.server.core.TikaServerProcess.main(TikaServerProcess.java:125) [tika-server-standard-2.2.1.jar:2.2.1]
Caused by: java.lang.ClassNotFoundException: org.apache.tika.pipes.fetcher.http.HttpFetcher
at jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) ~[?:?]
at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) ~[?:?]
at java.lang.ClassLoader.loadClass(ClassLoader.java:520) ~[?:?]
at java.lang.Class.forName0(Native Method) ~[?:?]
at java.lang.Class.forName(Class.java:375) ~[?:?]
at org.apache.tika.config.ConfigBase.buildClass(ConfigBase.java:195) ~[tika-server-standard-2.2.1.jar:2.2.1]
... 6 more
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Failed to start forked process -- forked is not alive
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.apache.tika.server.core.TikaServerCli.mainLoop(TikaServerCli.java:116)
at org.apache.tika.server.core.TikaServerCli.execute(TikaServerCli.java:88)
at org.apache.tika.server.core.TikaServerCli.main(TikaServerCli.java:66)
Caused by: java.lang.RuntimeException: Failed to start forked process -- forked is not alive
at org.apache.tika.server.core.TikaServerWatchDog$ForkedProcess.<init>(TikaServerWatchDog.java:306)
at org.apache.tika.server.core.TikaServerWatchDog$ForkedProcess.<init>(TikaServerWatchDog.java:269)
at org.apache.tika.server.core.TikaServerWatchDog.startForkedProcess(TikaServerWatchDog.java:209)
at org.apache.tika.server.core.TikaServerWatchDog.call(TikaServerWatchDog.java:143)
at org.apache.tika.server.core.TikaServerWatchDog.call(TikaServerWatchDog.java:53)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
ERROR [main] 19:24:09,582 org.apache.tika.server.core.TikaServerCli Can't start: 
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Failed to start forked process -- forked is not alive
at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
at org.apache.tika.server.core.TikaServerCli.mainLoop(TikaServerCli.java:116) ~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.server.core.TikaServerCli.execute(TikaServerCli.java:88) ~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.server.core.TikaServerCli.main(TikaServerCli.java:66) [tika-server-standard-2.2.1.jar:2.2.1]
Caused by: java.lang.RuntimeException: Failed to start forked process -- forked is not alive
at org.apache.tika.server.core.TikaServerWatchDog$ForkedProcess.<init>(TikaServerWatchDog.java:306) ~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.server.core.TikaServerWatchDog$ForkedProcess.<init>(TikaServerWatchDog.java:269) ~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.server.core.TikaServerWatchDog.startForkedProcess(TikaServerWatchDog.java:209) ~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.server.core.TikaServerWatchDog.call(TikaServerWatchDog.java:143) ~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.server.core.TikaServerWatchDog.call(TikaServerWatchDog.java:53) ~[tika-server-standard-2.2.1.jar:2.2.1]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:833) ~[?:?]
{code}
I frankly don't know what to do. And how to get this this thing accept URL's.

{{}}

> A replacement for enableFileUrl or Support for Google Cloud
> -----------------------------------------------------------
>
>                 Key: TIKA-3523
>                 URL: https://issues.apache.org/jira/browse/TIKA-3523
>             Project: Tika
>          Issue Type: Wish
>          Components: tika-server
>    Affects Versions: 2.0.0
>            Reporter: Fatih Pazarbasi
>            Priority: Minor
>
> Hello,
> I have a setup where users upload their files to a cloud bucket and I forward the fileUrl to make ocr on them in a serverless cloud instance. I do it this way so the users do not contact with the Tika Server and I have a copy of what they've sent to process it. Also they have nothing to do with the unprocessed response.
> Now that you've removed the enableFileUrl... I have to download the files to the backend instance from the cloud bucket they have uploaded their files to, and put them to /tika server back again...
> I tried the following config.xml to work around the situation but it was in vain...
>   For the made up url: [https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/somefilethatdoesnotexist.pdf|https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/]
> {code:java}
> <fetchers> 
>  <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> 
>   <params> 
>    <name>fsf</name> 
>    <basePath>https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o</basePath> 
>   </params> 
>  </fetcher> 
> </fetchers> 
> <emitters> 
>  <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter"> 
>   <params> 
>    <name>fse</name> 
>    <basePath>gs://abcd-efgh.appspot.com/users</basePath> 
>   </params> 
>  </emitter> 
> </emitters> 
> <server> 
>  <params> 
>   <enableUnsecureFeatures>true</enableUnsecureFeatures> 
>  </params> 
> </server> 
> <pipes> 
>  <params> 
>   <tikaConfig>/path/to/tika-config.xml</tikaConfig> 
>  </params> 
> </pipes>{code}
> {code:java}
> headers: {         
> Accept: 'text/plain',         
> 'User-Agent': 'Firebase Functions',         
> fetcherName: 'fsf',         
> fetchKey: 'somefilethatdoesnotexist.pdf',   
> },{code}
> It doesn't support the gs:// Google Storage bucket either. I have all the necessary permissions but it didn't help. I'm using a dockerized version of tika server, so the file System does not seem to be my concern...
>   
>  In the golden times of 1.2x Iwas simply using:
>   
> {code:java}
> headers: {               
> Accept: 'text/plain',               
> 'User-Agent': 'Firebase Functions',               
> fileUrl: 'https://firebasestorage.googleapis.com/v0/b/abcd-efgh.appspot.com/o/somefilethatdoesnotexist.pdf',             
> },{code}
>  
>   
>  Am I missing something? If not my wish is that can you please make it so that fetchName is the definitive  first part of the old fileUrl and fetchKey is the specific pointer to a file?
> This way I have control over the urls that's been sent to tika server to some extend, unlike enableFileUrl and also eat my cake without creating extra traffic on the backend by downloading from the bucket and uploading to tika. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)