You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Alex Ethier (Jira)" <ji...@apache.org> on 2024/02/14 19:27:00 UTC

[jira] [Created] (NIFI-12791) ParseDocument PDF - Missing Pillow dependency

Alex Ethier created NIFI-12791:
----------------------------------

             Summary: ParseDocument PDF - Missing Pillow dependency
                 Key: NIFI-12791
                 URL: https://issues.apache.org/jira/browse/NIFI-12791
             Project: Apache NiFi
          Issue Type: Bug
          Components: Extensions
    Affects Versions: 2.0.0-M2
            Reporter: Alex Ethier
            Assignee: Alex Ethier


Custom Python processor ParseDocument, when configured to parse PDFs, gives an exception below due to a missing import.

The error message `ModuleNotFoundError: No module named 'pillow_heif' indicates that the latest version of unstructured dependency now requires 'pillow_heif' to be installed.

Full Stacktrace:
{code:java}
py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last):
  File "/opt/nifi-2.0.0-SNAPSHOT/python/framework/py4j/java_gateway.py", line 2466, in _call_proxy
    return_value = getattr(self.pool[obj_id], method)(*params)
  File "/opt/nifi-2.0.0-SNAPSHOT/python/api/nifiapi/flowfiletransform.py", line 33, in transformFlowFile
    return self.transform(self.process_context, flowfile)
  File "/opt/nifi-2.0.0-SNAPSHOT/./python/extensions/ParseDocument.py", line 257, in transform
    documents = self.create_docs(context, flowFile)
  File "/opt/nifi-2.0.0-SNAPSHOT/./python/extensions/ParseDocument.py", line 225, in create_docs
    documents = loader.load()
  File "/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/langchain_community/document_loaders/unstructured.py", line 87, in load
    elements = self._get_elements()
  File "/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/langchain_community/document_loaders/pdf.py", line 57, in _get_elements
    from unstructured.partition.pdf import partition_pdf
  File "/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/unstructured/partition/pdf.py", line 38, in <module>
    from pillow_heif import register_heif_opener
ModuleNotFoundError: No module named 'pillow_heif'


	at py4j.Protocol.getReturnValue(Protocol.java:476)
	at org.apache.nifi.py4j.client.PythonProxyInvocationHandler.invoke(PythonProxyInvocationHandler.java:64)
	at org.apache.nifi.py4j.client.NiFiPythonGateway$1.invoke(NiFiPythonGateway.java:148)
	at jdk.proxy29/jdk.proxy29.$Proxy179.transformFlowFile(Unknown Source)
	at org.apache.nifi.python.processor.FlowFileTransformProxy.onTrigger(FlowFileTransformProxy.java:66)
	at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
	at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1274)
	at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:244)
	at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:102)
	at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583) {code}
Including 'pillow-heif' in the list of required dependencies for ParseDocument fixes the issue (PR forthcoming).

Another possible fix is locking the version numbers to prevent dependencies from causing breaking updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)