You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Christian Ribeaud <ch...@karakun.com> on 2022/11/01 15:55:11 UTC
Re: Paragraph words getting merged
Tim,
what do you exactly mean by Tika appears to add a new line in the correct spot at least for IDEC-102...?
This is correct but it is an ignorable whitespace, right?
Best,
christian
From: Tim Allison <ta...@apache.org>
Date: Monday, 31 October 2022 at 16:22
To: user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Paragraph words getting merged
Y, I agree with Nick. Tika appears to add a new line in the correct spot at least for IDEC-102...
On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <ni...@apache.org>> wrote:
On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> I am using the default configuration. I think, we could reduce my
> problem to following code snippet:
Is there a reason that you aren't using one of the built-in Tika content
handlers? Generally they should be taking care of everything for you with
paragraphs, plain text vs html etc
Nick
Re: Paragraph words getting merged
Posted by Christian Ribeaud <ch...@karakun.com>.
Hi Tim,
Thank you so much to enlighten that part to me. THAT is really useful.
Kindest regards,
christian
From: Tim Allison <ta...@apache.org>
Date: Tuesday, 1 November 2022 at 17:09
To: user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Paragraph words getting merged
Sorry. Took a while to make time to look in detail. Yes, Tika adds "ignorable whitespace". Specifically in the case mentioned, the PDFParser writes a line separator, which has our XHTMLContentHandler in turn call ignoreableWhitespace:
@Override
protected void writeLineSeparator() throws IOException {
try {
xhtml.newline();
} catch (SAXException e) {
throw new IOException("Unable to write a newline character", e);
}
}
public void newline() throws SAXException {
ignorableWhitespace(NL, 0, NL.length);
}
On Tue, Nov 1, 2022 at 11:55 AM Christian Ribeaud <ch...@karakun.com>> wrote:
Tim,
what do you exactly mean by Tika appears to add a new line in the correct spot at least for IDEC-102...?
This is correct but it is an ignorable whitespace, right?
Best,
christian
From: Tim Allison <ta...@apache.org>>
Date: Monday, 31 October 2022 at 16:22
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Subject: Re: Paragraph words getting merged
Y, I agree with Nick. Tika appears to add a new line in the correct spot at least for IDEC-102...
On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <ni...@apache.org>> wrote:
On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> I am using the default configuration. I think, we could reduce my
> problem to following code snippet:
Is there a reason that you aren't using one of the built-in Tika content
handlers? Generally they should be taking care of everything for you with
paragraphs, plain text vs html etc
Nick
Re: Paragraph words getting merged
Posted by Tim Allison <ta...@apache.org>.
Sorry. Took a while to make time to look in detail. Yes, Tika adds
"ignorable whitespace". Specifically in the case mentioned, the PDFParser
writes a line separator, which has our XHTMLContentHandler in turn call
ignoreableWhitespace:
@Override
protected void writeLineSeparator() throws IOException {
try {
xhtml.newline();
} catch (SAXException e) {
throw new IOException("Unable to write a newline character", e);
}
}
public void newline() throws SAXException {
ignorableWhitespace(NL, 0, NL.length);
}
On Tue, Nov 1, 2022 at 11:55 AM Christian Ribeaud <
christian.ribeaud@karakun.com> wrote:
> Tim,
>
>
>
> what do you exactly mean by *Tika appears to add a new line in the
> correct spot at least for IDEC-102...*?
>
> This is correct but it is an ignorable whitespace, right?
>
>
>
> Best,
>
>
>
> christian
>
>
>
> *From: *Tim Allison <ta...@apache.org>
> *Date: *Monday, 31 October 2022 at 16:22
> *To: *user@tika.apache.org <us...@tika.apache.org>
> *Subject: *Re: Paragraph words getting merged
>
> Y, I agree with Nick. Tika appears to add a new line in the correct spot
> at least for IDEC-102...
>
>
>
> On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <ni...@apache.org> wrote:
>
> On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> > I am using the default configuration. I think, we could reduce my
> > problem to following code snippet:
>
> Is there a reason that you aren't using one of the built-in Tika content
> handlers? Generally they should be taking care of everything for you with
> paragraphs, plain text vs html etc
>
> Nick
>
>