You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Rida Benjelloun (JIRA)" <ji...@apache.org> on 2008/01/14 18:44:34 UTC

[jira] Created: (TIKA-114) PDFParser : Getting content of the document using "writer.ToString ()" , some words are stuck together

PDFParser : Getting content of the document using "writer.ToString ()" , some words are stuck together
------------------------------------------------------------------------------------------------------

                 Key: TIKA-114
                 URL: https://issues.apache.org/jira/browse/TIKA-114
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.2-incubating
            Reporter: Rida Benjelloun
             Fix For: 0.2-incubating


PDFParser : Getting the content of the document using "writer.ToString ()" , some words are stuck together
Result of PDF extraction : 
"Apache Tika - Apache Tikahttp://incubator.apache.org/tika/1 of 115.9.2007 11:02Tika - Content Analysis ToolkitApache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Lucene PMC. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.See the Apache Tika Incubation Status page for the current incubation status.Latest NewsMarch 22nd, 2007: Apache Tika project startedThe Apache Tika project was formally started when the Tika proposal was accepted by the Apache Incubator PMC."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-114) PDFParser : Getting content of the document using "writer.ToString ()" , some words are stuck together

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628859#action_12628859 ] 

Dave Meikle commented on TIKA-114:
----------------------------------

OK, processLineSeparator  and processLineSeparator are not available in PDFBox-0.7.3 which is what we have as our dependency. They are however available on SVN HEAD of the PDFBox Incubator project, so if you build and use that it works fine. I noticed a lot of people are using either dev builds or their own compiled versions.

I see that they are looking to do a first release under the new Apache Incubator project, but need to resolve PDFBOX-366 (https://issues.apache.org/jira/browse/PDFBOX-366). Jukka, do you know the status of this?

If we want to move release TIKA incubating-0.2 before the first PDFBox release there is a workaround, that I don't particularly like myself but would solve the problem when using PDFBox-0.7.3 - will attach this in a patch.


> PDFParser : Getting content of the document using "writer.ToString ()" , some words are stuck together
> ------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-114
>                 URL: https://issues.apache.org/jira/browse/TIKA-114
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: Rida Benjelloun
>             Fix For: 0.2-incubating
>
>
> PDFParser : Getting the content of the document using "writer.ToString ()" , some words are stuck together
> Result of PDF extraction : 
> "Apache Tika - Apache Tikahttp://incubator.apache.org/tika/1 of 115.9.2007 11:02Tika - Content Analysis ToolkitApache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Lucene PMC. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.See the Apache Tika Incubation Status page for the current incubation status.Latest NewsMarch 22nd, 2007: Apache Tika project startedThe Apache Tika project was formally started when the Tika proposal was accepted by the Apache Incubator PMC."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-114) PDFParser : Getting content of the document using "writer.ToString ()" , some words are stuck together

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Meikle updated TIKA-114:
-----------------------------

    Attachment: TIKA-114.diff

Possible workaround for PDFBox-0.7.3 - still don't like it though :-)

> PDFParser : Getting content of the document using "writer.ToString ()" , some words are stuck together
> ------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-114
>                 URL: https://issues.apache.org/jira/browse/TIKA-114
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: Rida Benjelloun
>             Fix For: 0.2-incubating
>
>         Attachments: TIKA-114.diff
>
>
> PDFParser : Getting the content of the document using "writer.ToString ()" , some words are stuck together
> Result of PDF extraction : 
> "Apache Tika - Apache Tikahttp://incubator.apache.org/tika/1 of 115.9.2007 11:02Tika - Content Analysis ToolkitApache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Lucene PMC. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.See the Apache Tika Incubation Status page for the current incubation status.Latest NewsMarch 22nd, 2007: Apache Tika project startedThe Apache Tika project was formally started when the Tika proposal was accepted by the Apache Incubator PMC."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (TIKA-114) PDFParser : Getting content of the document using "writer.ToString ()" , some words are stuck together

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-114.
--------------------------------

    Resolution: Fixed
      Assignee: Jukka Zitting

Patch applied in revision 695266. Thanks!

I created issue TIKA-158 for upgrading to the next PDFBox release once it becomes available. Until that I guess we can live with this workaround.

> PDFParser : Getting content of the document using "writer.ToString ()" , some words are stuck together
> ------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-114
>                 URL: https://issues.apache.org/jira/browse/TIKA-114
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: Rida Benjelloun
>            Assignee: Jukka Zitting
>             Fix For: 0.2-incubating
>
>         Attachments: TIKA-114.diff
>
>
> PDFParser : Getting the content of the document using "writer.ToString ()" , some words are stuck together
> Result of PDF extraction : 
> "Apache Tika - Apache Tikahttp://incubator.apache.org/tika/1 of 115.9.2007 11:02Tika - Content Analysis ToolkitApache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Lucene PMC. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.See the Apache Tika Incubation Status page for the current incubation status.Latest NewsMarch 22nd, 2007: Apache Tika project startedThe Apache Tika project was formally started when the Tika proposal was accepted by the Apache Incubator PMC."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-114) PDFParser : Getting content of the document using "writer.ToString ()" , some words are stuck together

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560803#action_12560803 ] 

Jukka Zitting commented on TIKA-114:
------------------------------------

PDFBox doesn't seem to call the processLineSeparator method in our PDF2XHML class. I'll investigate...

> PDFParser : Getting content of the document using "writer.ToString ()" , some words are stuck together
> ------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-114
>                 URL: https://issues.apache.org/jira/browse/TIKA-114
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: Rida Benjelloun
>             Fix For: 0.2-incubating
>
>
> PDFParser : Getting the content of the document using "writer.ToString ()" , some words are stuck together
> Result of PDF extraction : 
> "Apache Tika - Apache Tikahttp://incubator.apache.org/tika/1 of 115.9.2007 11:02Tika - Content Analysis ToolkitApache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Lucene PMC. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.See the Apache Tika Incubation Status page for the current incubation status.Latest NewsMarch 22nd, 2007: Apache Tika project startedThe Apache Tika project was formally started when the Tika proposal was accepted by the Apache Incubator PMC."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-114) PDFParser : Getting content of the document using "writer.ToString ()" , some words are stuck together

Posted by "Dave Meikle (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628853#action_12628853 ] 

Dave Meikle commented on TIKA-114:
----------------------------------

Looks like this has been raised as a tracker against PDFBox itself
http://sourceforge.net/tracker/index.php?func=detail&aid=1917403&group_id=78314&atid=552832

I am going to take a look at the PDFBox code.

> PDFParser : Getting content of the document using "writer.ToString ()" , some words are stuck together
> ------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-114
>                 URL: https://issues.apache.org/jira/browse/TIKA-114
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: Rida Benjelloun
>             Fix For: 0.2-incubating
>
>
> PDFParser : Getting the content of the document using "writer.ToString ()" , some words are stuck together
> Result of PDF extraction : 
> "Apache Tika - Apache Tikahttp://incubator.apache.org/tika/1 of 115.9.2007 11:02Tika - Content Analysis ToolkitApache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Lucene PMC. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.See the Apache Tika Incubation Status page for the current incubation status.Latest NewsMarch 22nd, 2007: Apache Tika project startedThe Apache Tika project was formally started when the Tika proposal was accepted by the Apache Incubator PMC."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.