You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Christian Ribeaud <ch...@karakun.com> on 2022/10/30 16:00:48 UTC

Paragraph words getting merged

Dear all,

I would need some insights on how Tika is working. We are using Tika v1.28.5 and PDFBox v2.0.27.
Attached, you will find the extracted texts, as plain and as HTML.

I am using the default configuration. I think, we could reduce my problem to following code snippet:

[code]
// PageContentHandler extends DefaultHandler

@Override
public void ignorableWhitespace(char[] ch, int start, int length) {
     // We ignore white spaces
}

@Override
public void characters(char[] ch, int start, int length) {
    // Append data
    if (length > 0) {
        builder.append(ch);
    }
}
[/code]

Per book page (as PDF), I am building its content using Tika before submitting it to an OpenSearch instance.
For a reason I do not understand, terms get merged where they should NOT:

[code]
// Extracted.html

<i>IDEC-102
</i></p>
<p>Rivaroxaban. Ci9Hi8ClN3O5S. 5-Chloro-N-({(5S)-2-oxo-3-
[/code]

IDEC-102 and Rivaroxaban belongs to different paragraphs up to extracted.html. However my logic glues
them together and we get a word like IDEC-102Rivaroxaban which does not make any sense in this context.

One possibility to fix the problem would be:

[code]
@Override
public void ignorableWhitespace(char[] ch, int start, int length) {
    if (length > 0) {
        builder.append(ch);
    }
}
[/code]

Does this change make sense? Up to my understanding, ignorable whitespaces should be well... ignorable, no?
Did I misunderstand something, or did I do something wrong?

Any feedback here very welcome. Best regards,

christian

Re: Paragraph words getting merged

Posted by Christian Ribeaud <ch...@karakun.com>.

Hi Tim,

This is what I am actually doing: I’m parsing the XHTML and collect the pages by identifying the corresponding DIVs. This works really nicely. And, actually, all the engine is working nicely.
I’m just having this very specific and small problem I described in my original email, a problem I would like to better understand before delivering a suitable fix.

I think, we should focus on the original problem. Now, you have the context… 😉

This is what I understand and please correct me if I am wrong. In a normal text flow, we would expect a text section/paragraph to end with a dot or something similar.
In the page I posted in my original message, we do not have an ending dot:

[code]
// Extracted.html

<i>IDEC-102
</i></p>
<p>Rivaroxaban. Ci9Hi8ClN3O5S. 5-Chloro-N-({(5S)-2-oxo-3-
[/code]

And, because I am appending the text delivered by Tika, in the example above, IDEC-102 gets merged with Rivaroxaban. As I said, one possible way to get rid of the problem would be to use (in my custom content handler):

[code]
@Override
public void ignorableWhitespace(char[] ch, int start, int length) {
    if (length > 0) {
        builder.append(ch);
    }
}
[/code]

Instead of:

[code]
@Override
public void ignorableWhitespace(char[] ch, int start, int length) {
     // We ignore white spaces
}
[/code]

But this does not feel very natural to me. When switching to a new section/paragraph, I would expect Tika to give me a new line or a space but NOT as ignorable whitespace. Usually and within a given section/paragraph I get an ending space for each sentence, right?

Is my problem now clearer?

Thanks a lot for your time and your patience,

christian

From: Tim Allison <ta...@apache.org>
Date: Monday, 31 October 2022 at 20:09
To: user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Paragraph words getting merged
We add <div class="page">.*</div> markers in our xhtml.  Would that meet your needs?  Parse the xhtml and send to Elastic?  Or are you looking to send data per page directly to Elasticsearch during the parse?

On Mon, Oct 31, 2022 at 1:43 PM Christian Ribeaud <ch...@karakun.com>> wrote:
Hi Tim,

Sorry to not be clear enough. I want to index each page as separate document in OpenSearch (aka ex-Elasticsearch).
The page count and language are relevant for the book metadata only (which get stored in a DynamoDB table).

Cheers,

christian

From: Tim Allison <ta...@apache.org>>
Date: Monday, 31 October 2022 at 18:37
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Subject: Re: Paragraph words getting merged
I'm sorry.  I'm missing the context.  What are you trying to accomplish?  Do you want to index each page as a separate document in Elasticsearch?  Or, is the langid + pagecount critical for your needs and somehow you need to create your own handler for those?

On Mon, Oct 31, 2022 at 1:16 PM Christian Ribeaud <ch...@karakun.com>> wrote:
Good evening,

Thanks for the prompt answer. AFAIR (the project is old, but the problem is new) I needed a mechanism to process pages in batches.

The software is handling huge books (I think, the biggest ones are around half GB) as Lambda in AWS.
Due to the memory and CPU limitations of AWS Lambda, this is the way we decided to go.

Does Tika already offer such content handler? If not, which strategy would you then suggest?

We have our custom TikaPageContentHandler, which is plugged as following:

[code]
public void extractTextAndUploadToElasticsearch(long maxMainMemoryBytes, InputStream stream, long fileSize, String fileName) throws TikaException, IOException, SAXException {
    String baseName = FilenameUtils.getBaseName(fileName);
    final int bulkSize = getElasticBulkSize();
    URL tikaConfigUrl = TikaLambda.class.getResource("/config/tika-config.xml");
    assert tikaConfigUrl != null : "Unspecified Tika configuration";
    TikaPageContentHandler tikaPageContentHandler = new TikaPageContentHandler(elasticsearchClient, baseName,
            bulkSize);
    Metadata metadata = new Metadata();
    ParseContext parseContext = new ParseContext();
    PDFParserConfig pdfParserConfig = new PDFParserConfig();
    pdfParserConfig.setMaxMainMemoryBytes(maxMainMemoryBytes);
    LogUtils.info(LOG, () -> String.format("Using following PDF parser configuration '%s'.",
            ToStringBuilder.reflectionToString(pdfParserConfig, ToStringStyle.MULTI_LINE_STYLE)));
    // Overrides the default values specified in 'tika-config.xml'
    parseContext.set(PDFParserConfig.class, pdfParserConfig);
    TikaConfig tikaConfig = new TikaConfig(tikaConfigUrl);
    // Auto-detecting parser. So, we theoretically are able to handle any document.
    AutoDetectParser parser = new AutoDetectParser(tikaConfig);
    parseContext.set(Parser.class, parser);
    parser.parse(stream, tikaPageContentHandler, metadata, parseContext);
    int pageCount = tikaPageContentHandler.getPageCount();
    LogUtils.info(LOG, () -> String.format("%d/%d page(s) of document identified by ID '%s' have been submitted.",
            tikaPageContentHandler.getSubmittedPageCount(), pageCount, baseName));
    LanguageResult languageResult = tikaPageContentHandler.getLanguageResult();
    String language = languageResult.isReasonablyCertain() ? languageResult.getLanguage() : null;
    // Put an entry into DynamoDb.
    putItem(baseName, fileSize, pageCount, language);
}
[/code]

Christian

From: Tim Allison <ta...@apache.org>>
Date: Monday, 31 October 2022 at 16:22
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Subject: Re: Paragraph words getting merged
Y, I agree with Nick. Tika appears to add a new line in the correct spot at least for IDEC-102...

On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <ni...@apache.org>> wrote:
On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> I am using the default configuration. I think, we could reduce my
> problem to following code snippet:

Is there a reason that you aren't using one of the built-in Tika content
handlers? Generally they should be taking care of everything for you with
paragraphs, plain text vs html etc

Nick

Re: Paragraph words getting merged

Posted by Tim Allison <ta...@apache.org>.

We add <div class="page">.*</div> markers in our xhtml.  Would that meet
your needs?  Parse the xhtml and send to Elastic?  Or are you looking to
send data per page directly to Elasticsearch during the parse?

On Mon, Oct 31, 2022 at 1:43 PM Christian Ribeaud <
christian.ribeaud@karakun.com> wrote:

> Hi Tim,
>
>
>
> Sorry to not be clear enough. I want to index each page as separate
> document in *OpenSearch* (aka ex-*Elasticsearch*).
>
> The page count and language are relevant for the book metadata only (which
> get stored in a *DynamoDB* table).
>
>
>
> Cheers,
>
>
>
> christian
>
>
>
>
>
> *From: *Tim Allison <ta...@apache.org>
> *Date: *Monday, 31 October 2022 at 18:37
> *To: *user@tika.apache.org <us...@tika.apache.org>
> *Subject: *Re: Paragraph words getting merged
>
> I'm sorry.  I'm missing the context.  What are you trying to accomplish?
> Do you want to index each page as a separate document in Elasticsearch?
> Or, is the langid + pagecount critical for your needs and somehow you need
> to create your own handler for those?
>
>
>
> On Mon, Oct 31, 2022 at 1:16 PM Christian Ribeaud <
> christian.ribeaud@karakun.com> wrote:
>
> Good evening,
>
>
>
> Thanks for the prompt answer. AFAIR (the project is old, but the problem
> is new) I needed a mechanism to process pages in batches.
>
>
>
> The software is handling huge books (I think, the biggest ones are around
> half GB) as *Lambda* in *AWS*.
>
> Due to the memory and CPU limitations of *AWS Lambda*, this is the way we
> decided to go.
>
>
>
> Does *Tika* already offer such content handler? If not, which strategy
> would you then suggest?
>
>
>
> We have our custom *TikaPageContentHandler*, which is plugged as
> following:
>
>
>
> [code]
>
> public void extractTextAndUploadToElasticsearch(long maxMainMemoryBytes,
> InputStream stream, long fileSize, String fileName) throws TikaException,
> IOException, SAXException {
>
>     String baseName = FilenameUtils.getBaseName(fileName);
>
>     final int bulkSize = getElasticBulkSize();
>
>     URL tikaConfigUrl =
> TikaLambda.class.getResource("/config/tika-config.xml");
>
>     assert tikaConfigUrl != null : "Unspecified Tika configuration";
>
>     TikaPageContentHandler tikaPageContentHandler = new
> TikaPageContentHandler(elasticsearchClient, baseName,
>
>             bulkSize);
>
>     Metadata metadata = new Metadata();
>
>     ParseContext parseContext = new ParseContext();
>
>     PDFParserConfig pdfParserConfig = new PDFParserConfig();
>
>     pdfParserConfig.setMaxMainMemoryBytes(maxMainMemoryBytes);
>
>     LogUtils.info(LOG, () -> String.format("Using following PDF parser
> configuration '%s'.",
>
>             ToStringBuilder.reflectionToString(pdfParserConfig,
> ToStringStyle.MULTI_LINE_STYLE)));
>
>     // Overrides the default values specified in 'tika-config.xml'
>
>     parseContext.set(PDFParserConfig.class, pdfParserConfig);
>
>     TikaConfig tikaConfig = new TikaConfig(tikaConfigUrl);
>
>     // Auto-detecting parser. So, we theoretically are able to handle any
> document.
>
>     AutoDetectParser parser = new AutoDetectParser(tikaConfig);
>
>     parseContext.set(Parser.class, parser);
>
>     parser.parse(stream, tikaPageContentHandler, metadata, parseContext);
>
>     int pageCount = tikaPageContentHandler.getPageCount();
>
>     LogUtils.info(LOG, () -> String.format("%d/%d page(s) of document
> identified by ID '%s' have been submitted.",
>
>             tikaPageContentHandler.getSubmittedPageCount(), pageCount,
> baseName));
>
>     LanguageResult languageResult =
> tikaPageContentHandler.getLanguageResult();
>
>     String language = languageResult.isReasonablyCertain() ?
> languageResult.getLanguage() : null;
>
>     // Put an entry into DynamoDb.
>
>     putItem(baseName, fileSize, pageCount, language);
>
> }
>
> [/code]
>
>
>
> Christian
>
>
>
> *From: *Tim Allison <ta...@apache.org>
> *Date: *Monday, 31 October 2022 at 16:22
> *To: *user@tika.apache.org <us...@tika.apache.org>
> *Subject: *Re: Paragraph words getting merged
>
> Y, I agree with Nick. Tika appears to add a new line in the correct spot
> at least for IDEC-102...
>
>
>
> On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <ni...@apache.org> wrote:
>
> On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> > I am using the default configuration. I think, we could reduce my
> > problem to following code snippet:
>
> Is there a reason that you aren't using one of the built-in Tika content
> handlers? Generally they should be taking care of everything for you with
> paragraphs, plain text vs html etc
>
> Nick
>
>

Re: Paragraph words getting merged

Posted by Christian Ribeaud <ch...@karakun.com>.

Hi Tim,

Sorry to not be clear enough. I want to index each page as separate document in OpenSearch (aka ex-Elasticsearch).
The page count and language are relevant for the book metadata only (which get stored in a DynamoDB table).

Cheers,

christian

From: Tim Allison <ta...@apache.org>
Date: Monday, 31 October 2022 at 18:37
To: user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Paragraph words getting merged
I'm sorry.  I'm missing the context.  What are you trying to accomplish?  Do you want to index each page as a separate document in Elasticsearch?  Or, is the langid + pagecount critical for your needs and somehow you need to create your own handler for those?

On Mon, Oct 31, 2022 at 1:16 PM Christian Ribeaud <ch...@karakun.com>> wrote:
Good evening,

Thanks for the prompt answer. AFAIR (the project is old, but the problem is new) I needed a mechanism to process pages in batches.

The software is handling huge books (I think, the biggest ones are around half GB) as Lambda in AWS.
Due to the memory and CPU limitations of AWS Lambda, this is the way we decided to go.

Does Tika already offer such content handler? If not, which strategy would you then suggest?

We have our custom TikaPageContentHandler, which is plugged as following:

[code]
public void extractTextAndUploadToElasticsearch(long maxMainMemoryBytes, InputStream stream, long fileSize, String fileName) throws TikaException, IOException, SAXException {
    String baseName = FilenameUtils.getBaseName(fileName);
    final int bulkSize = getElasticBulkSize();
    URL tikaConfigUrl = TikaLambda.class.getResource("/config/tika-config.xml");
    assert tikaConfigUrl != null : "Unspecified Tika configuration";
    TikaPageContentHandler tikaPageContentHandler = new TikaPageContentHandler(elasticsearchClient, baseName,
            bulkSize);
    Metadata metadata = new Metadata();
    ParseContext parseContext = new ParseContext();
    PDFParserConfig pdfParserConfig = new PDFParserConfig();
    pdfParserConfig.setMaxMainMemoryBytes(maxMainMemoryBytes);
    LogUtils.info(LOG, () -> String.format("Using following PDF parser configuration '%s'.",
            ToStringBuilder.reflectionToString(pdfParserConfig, ToStringStyle.MULTI_LINE_STYLE)));
    // Overrides the default values specified in 'tika-config.xml'
    parseContext.set(PDFParserConfig.class, pdfParserConfig);
    TikaConfig tikaConfig = new TikaConfig(tikaConfigUrl);
    // Auto-detecting parser. So, we theoretically are able to handle any document.
    AutoDetectParser parser = new AutoDetectParser(tikaConfig);
    parseContext.set(Parser.class, parser);
    parser.parse(stream, tikaPageContentHandler, metadata, parseContext);
    int pageCount = tikaPageContentHandler.getPageCount();
    LogUtils.info(LOG, () -> String.format("%d/%d page(s) of document identified by ID '%s' have been submitted.",
            tikaPageContentHandler.getSubmittedPageCount(), pageCount, baseName));
    LanguageResult languageResult = tikaPageContentHandler.getLanguageResult();
    String language = languageResult.isReasonablyCertain() ? languageResult.getLanguage() : null;
    // Put an entry into DynamoDb.
    putItem(baseName, fileSize, pageCount, language);
}
[/code]

Christian

From: Tim Allison <ta...@apache.org>>
Date: Monday, 31 October 2022 at 16:22
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Subject: Re: Paragraph words getting merged
Y, I agree with Nick. Tika appears to add a new line in the correct spot at least for IDEC-102...

On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <ni...@apache.org>> wrote:
On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> I am using the default configuration. I think, we could reduce my
> problem to following code snippet:

Is there a reason that you aren't using one of the built-in Tika content
handlers? Generally they should be taking care of everything for you with
paragraphs, plain text vs html etc

Nick

Re: Paragraph words getting merged

Posted by Tim Allison <ta...@apache.org>.

I'm sorry.  I'm missing the context.  What are you trying to accomplish?
Do you want to index each page as a separate document in Elasticsearch?
Or, is the langid + pagecount critical for your needs and somehow you need
to create your own handler for those?

On Mon, Oct 31, 2022 at 1:16 PM Christian Ribeaud <
christian.ribeaud@karakun.com> wrote:

> Good evening,
>
>
>
> Thanks for the prompt answer. AFAIR (the project is old, but the problem
> is new) I needed a mechanism to process pages in batches.
>
>
>
> The software is handling huge books (I think, the biggest ones are around
> half GB) as *Lambda* in *AWS*.
>
> Due to the memory and CPU limitations of *AWS Lambda*, this is the way we
> decided to go.
>
>
>
> Does *Tika* already offer such content handler? If not, which strategy
> would you then suggest?
>
>
>
> We have our custom *TikaPageContentHandler*, which is plugged as
> following:
>
>
>
> [code]
>
> public void extractTextAndUploadToElasticsearch(long maxMainMemoryBytes,
> InputStream stream, long fileSize, String fileName) throws TikaException,
> IOException, SAXException {
>
>     String baseName = FilenameUtils.getBaseName(fileName);
>
>     final int bulkSize = getElasticBulkSize();
>
>     URL tikaConfigUrl =
> TikaLambda.class.getResource("/config/tika-config.xml");
>
>     assert tikaConfigUrl != null : "Unspecified Tika configuration";
>
>     TikaPageContentHandler tikaPageContentHandler = new
> TikaPageContentHandler(elasticsearchClient, baseName,
>
>             bulkSize);
>
>     Metadata metadata = new Metadata();
>
>     ParseContext parseContext = new ParseContext();
>
>     PDFParserConfig pdfParserConfig = new PDFParserConfig();
>
>     pdfParserConfig.setMaxMainMemoryBytes(maxMainMemoryBytes);
>
>     LogUtils.info(LOG, () -> String.format("Using following PDF parser
> configuration '%s'.",
>
>             ToStringBuilder.reflectionToString(pdfParserConfig,
> ToStringStyle.MULTI_LINE_STYLE)));
>
>     // Overrides the default values specified in 'tika-config.xml'
>
>     parseContext.set(PDFParserConfig.class, pdfParserConfig);
>
>     TikaConfig tikaConfig = new TikaConfig(tikaConfigUrl);
>
>     // Auto-detecting parser. So, we theoretically are able to handle any
> document.
>
>     AutoDetectParser parser = new AutoDetectParser(tikaConfig);
>
>     parseContext.set(Parser.class, parser);
>
>     parser.parse(stream, tikaPageContentHandler, metadata, parseContext);
>
>     int pageCount = tikaPageContentHandler.getPageCount();
>
>     LogUtils.info(LOG, () -> String.format("%d/%d page(s) of document
> identified by ID '%s' have been submitted.",
>
>             tikaPageContentHandler.getSubmittedPageCount(), pageCount,
> baseName));
>
>     LanguageResult languageResult =
> tikaPageContentHandler.getLanguageResult();
>
>     String language = languageResult.isReasonablyCertain() ?
> languageResult.getLanguage() : null;
>
>     // Put an entry into DynamoDb.
>
>     putItem(baseName, fileSize, pageCount, language);
>
> }
>
> [/code]
>
>
>
> Christian
>
>
>
> *From: *Tim Allison <ta...@apache.org>
> *Date: *Monday, 31 October 2022 at 16:22
> *To: *user@tika.apache.org <us...@tika.apache.org>
> *Subject: *Re: Paragraph words getting merged
>
> Y, I agree with Nick. Tika appears to add a new line in the correct spot
> at least for IDEC-102...
>
>
>
> On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <ni...@apache.org> wrote:
>
> On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> > I am using the default configuration. I think, we could reduce my
> > problem to following code snippet:
>
> Is there a reason that you aren't using one of the built-in Tika content
> handlers? Generally they should be taking care of everything for you with
> paragraphs, plain text vs html etc
>
> Nick
>
>

Re: Paragraph words getting merged

Posted by Christian Ribeaud <ch...@karakun.com>.

Good evening,

Thanks for the prompt answer. AFAIR (the project is old, but the problem is new) I needed a mechanism to process pages in batches.

The software is handling huge books (I think, the biggest ones are around half GB) as Lambda in AWS.
Due to the memory and CPU limitations of AWS Lambda, this is the way we decided to go.

Does Tika already offer such content handler? If not, which strategy would you then suggest?

We have our custom TikaPageContentHandler, which is plugged as following:

[code]
public void extractTextAndUploadToElasticsearch(long maxMainMemoryBytes, InputStream stream, long fileSize, String fileName) throws TikaException, IOException, SAXException {
    String baseName = FilenameUtils.getBaseName(fileName);
    final int bulkSize = getElasticBulkSize();
    URL tikaConfigUrl = TikaLambda.class.getResource("/config/tika-config.xml");
    assert tikaConfigUrl != null : "Unspecified Tika configuration";
    TikaPageContentHandler tikaPageContentHandler = new TikaPageContentHandler(elasticsearchClient, baseName,
            bulkSize);
    Metadata metadata = new Metadata();
    ParseContext parseContext = new ParseContext();
    PDFParserConfig pdfParserConfig = new PDFParserConfig();
    pdfParserConfig.setMaxMainMemoryBytes(maxMainMemoryBytes);
    LogUtils.info(LOG, () -> String.format("Using following PDF parser configuration '%s'.",
            ToStringBuilder.reflectionToString(pdfParserConfig, ToStringStyle.MULTI_LINE_STYLE)));
    // Overrides the default values specified in 'tika-config.xml'
    parseContext.set(PDFParserConfig.class, pdfParserConfig);
    TikaConfig tikaConfig = new TikaConfig(tikaConfigUrl);
    // Auto-detecting parser. So, we theoretically are able to handle any document.
    AutoDetectParser parser = new AutoDetectParser(tikaConfig);
    parseContext.set(Parser.class, parser);
    parser.parse(stream, tikaPageContentHandler, metadata, parseContext);
    int pageCount = tikaPageContentHandler.getPageCount();
    LogUtils.info(LOG, () -> String.format("%d/%d page(s) of document identified by ID '%s' have been submitted.",
            tikaPageContentHandler.getSubmittedPageCount(), pageCount, baseName));
    LanguageResult languageResult = tikaPageContentHandler.getLanguageResult();
    String language = languageResult.isReasonablyCertain() ? languageResult.getLanguage() : null;
    // Put an entry into DynamoDb.
    putItem(baseName, fileSize, pageCount, language);
}
[/code]

Christian

From: Tim Allison <ta...@apache.org>
Date: Monday, 31 October 2022 at 16:22
To: user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Paragraph words getting merged
Y, I agree with Nick. Tika appears to add a new line in the correct spot at least for IDEC-102...

On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <ni...@apache.org>> wrote:
On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> I am using the default configuration. I think, we could reduce my
> problem to following code snippet:

Is there a reason that you aren't using one of the built-in Tika content
handlers? Generally they should be taking care of everything for you with
paragraphs, plain text vs html etc

Nick

Re: Paragraph words getting merged

Posted by Christian Ribeaud <ch...@karakun.com>.

Hi Tim,

Thank you so much to enlighten that part to me. THAT is really useful.

Kindest regards,

christian

From: Tim Allison <ta...@apache.org>
Date: Tuesday, 1 November 2022 at 17:09
To: user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Paragraph words getting merged
Sorry. Took a while to make time to look in detail.  Yes, Tika adds "ignorable whitespace".  Specifically in the case mentioned, the PDFParser writes a line separator, which has our XHTMLContentHandler in turn call ignoreableWhitespace:

@Override
protected void writeLineSeparator() throws IOException {
    try {
        xhtml.newline();
    } catch (SAXException e) {
        throw new IOException("Unable to write a newline character", e);
    }
}

public void newline() throws SAXException {
    ignorableWhitespace(NL, 0, NL.length);
}

On Tue, Nov 1, 2022 at 11:55 AM Christian Ribeaud <ch...@karakun.com>> wrote:
Tim,

what do you exactly mean by Tika appears to add a new line in the correct spot at least for IDEC-102...?
This is correct but it is an ignorable whitespace, right?

Best,

christian

From: Tim Allison <ta...@apache.org>>
Date: Monday, 31 October 2022 at 16:22
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Subject: Re: Paragraph words getting merged
Y, I agree with Nick. Tika appears to add a new line in the correct spot at least for IDEC-102...

On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <ni...@apache.org>> wrote:
On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> I am using the default configuration. I think, we could reduce my
> problem to following code snippet:

Is there a reason that you aren't using one of the built-in Tika content
handlers? Generally they should be taking care of everything for you with
paragraphs, plain text vs html etc

Nick

Re: Paragraph words getting merged

Posted by Tim Allison <ta...@apache.org>.

Sorry. Took a while to make time to look in detail.  Yes, Tika adds
"ignorable whitespace".  Specifically in the case mentioned, the PDFParser
writes a line separator, which has our XHTMLContentHandler in turn call
ignoreableWhitespace:

@Override
protected void writeLineSeparator() throws IOException {
    try {
        xhtml.newline();
    } catch (SAXException e) {
        throw new IOException("Unable to write a newline character", e);
    }
}

public void newline() throws SAXException {
    ignorableWhitespace(NL, 0, NL.length);
}




On Tue, Nov 1, 2022 at 11:55 AM Christian Ribeaud <
christian.ribeaud@karakun.com> wrote:

> Tim,
>
>
>
> what do you exactly mean by *Tika appears to add a new line in the
> correct spot at least for IDEC-102...*?
>
> This is correct but it is an ignorable whitespace, right?
>
>
>
> Best,
>
>
>
> christian
>
>
>
> *From: *Tim Allison <ta...@apache.org>
> *Date: *Monday, 31 October 2022 at 16:22
> *To: *user@tika.apache.org <us...@tika.apache.org>
> *Subject: *Re: Paragraph words getting merged
>
> Y, I agree with Nick. Tika appears to add a new line in the correct spot
> at least for IDEC-102...
>
>
>
> On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <ni...@apache.org> wrote:
>
> On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> > I am using the default configuration. I think, we could reduce my
> > problem to following code snippet:
>
> Is there a reason that you aren't using one of the built-in Tika content
> handlers? Generally they should be taking care of everything for you with
> paragraphs, plain text vs html etc
>
> Nick
>
>

Re: Paragraph words getting merged

Posted by Christian Ribeaud <ch...@karakun.com>.

Tim,

what do you exactly mean by Tika appears to add a new line in the correct spot at least for IDEC-102...?
This is correct but it is an ignorable whitespace, right?

Best,

christian

From: Tim Allison <ta...@apache.org>
Date: Monday, 31 October 2022 at 16:22
To: user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Paragraph words getting merged
Y, I agree with Nick. Tika appears to add a new line in the correct spot at least for IDEC-102...

On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <ni...@apache.org>> wrote:
On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> I am using the default configuration. I think, we could reduce my
> problem to following code snippet:

Is there a reason that you aren't using one of the built-in Tika content
handlers? Generally they should be taking care of everything for you with
paragraphs, plain text vs html etc

Nick

Re: Paragraph words getting merged

Posted by Tim Allison <ta...@apache.org>.

Y, I agree with Nick. Tika appears to add a new line in the correct spot at
least for IDEC-102...

On Mon, Oct 31, 2022 at 9:22 AM Nick Burch <ni...@apache.org> wrote:

> On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> > I am using the default configuration. I think, we could reduce my
> > problem to following code snippet:
>
> Is there a reason that you aren't using one of the built-in Tika content
> handlers? Generally they should be taking care of everything for you with
> paragraphs, plain text vs html etc
>
> Nick
>

Re: Paragraph words getting merged

Posted by Nick Burch <ni...@apache.org>.

On Sun, 30 Oct 2022, Christian Ribeaud wrote:
> I am using the default configuration. I think, we could reduce my 
> problem to following code snippet:

Is there a reason that you aren't using one of the built-in Tika content 
handlers? Generally they should be taking care of everything for you with 
paragraphs, plain text vs html etc

Nick