You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@camel.apache.org by furchess123 <co...@hotmail.com> on 2015/10/29 14:12:50 UTC

correct way to provide regex in TokenizerExpression?

What is the correct way to supply the regular expression in
TokenizerExpression? 

Per Claus's advise, I have tried the following to tokenize a file by lines
while grouping lines - using a regex to support more than one type of line
separators:

        TokenizerExpression tokenizerExpression = new TokenizerExpression();
        tokenizerExpression.setToken("\n|\r\n|\r");  // tokenize by line
separators (system-agnostic)
        tokenizerExpression.setGroup(500);// group so many lines into one
exchange
        tokenizerExpression.setRegex(true); 

        ...
        split(tokenizerExpression)...

The file gets split into lines that are grouped by 500, except every other
line in the group is not an actual line from the file, but *a line with the
single character '|'*.  The regex seems correct, but the Camel tokenizer
misinterprets it and adds a bogus line for every '|', which is part of the
regex language. 

I have tried various ways to write a regex, but the tokenizer always seems
to not parse it correctly and adds lines to the exchange that contain
nothing but regex language characters.

How do I provide a regex for the tokenizer to properly interpret it?
Specifically, the regex I am trying to use.

Thanks!



--
View this message in context: http://camel.465427.n5.nabble.com/correct-way-to-provide-regex-in-TokenizerExpression-tp5773192.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: correct way to provide regex in TokenizerExpression?

Posted by SWI <Se...@aixigo.de>.

Hi,

I totally aggree on furchess post and I guess issue
https://issues.apache.org/jira/browse/CAMEL-9241 is related to this topic.
Having the regex literal as delimiter on the grouped result seems broken. 

Actually we replace the "regex literal" after the tokenize took place but it
seems like bad idea to anticipate the "matching" delimiter.

Is there a way to upvote this issue?

Regards,

SWI




--
View this message in context: http://camel.465427.n5.nabble.com/correct-way-to-provide-regex-in-TokenizerExpression-tp5773192p5801297.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: correct way to provide regex in TokenizerExpression?

Posted by furchess123 <co...@hotmail.com>.

Hi Claus,
thank you for responding. The problem we are seeing currently is that, if we
provide a regex to the tokenizer to detect token delimiters, the tokenizer
inserts that expression literal into the payload itself - while replacing
the actual delimiters matched by the regex. I think you will agree that
modifying the original payload in any way other than splitting it into
chunks is not a desirable behavior.

I think the most natural and logical way would be to correct the existing
tokenizer functionality to:

1) Correctly identify the individual tokens by matching the delimiters
using the provided regular expression (as it is done today, indeed);
b) Ensure that the resulting exchange message body (a group of N tokens)
retains the original token separators (vs. them being replaced by the regex
literal.)

Also, for all it's worth, perhaps it would be helpful to slightly change the
terminology in the API documentation. What is currently described as the
"token" argument (or "token expression") to the tokenize() method is
actually the "token /delimiter/ expression" - the expression that matches
the delimiters that separate the tokens in the payload. So, in the case of a
file being split into lines or groups of lines, a token represents a line,
obviously, not the separator/delimiter. ;)

--
View this message in context: http://camel.465427.n5.nabble.com/correct-way-to-provide-regex-in-TokenizerExpression-tp5773192p5773322.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: correct way to provide regex in TokenizerExpression?

Posted by Claus Ibsen <cl...@gmail.com>.

So you want to split a file line by line and disregard what kind of
line terminators the file is using.

Camel uses the java.util.Scanner with tokenizer with that provided
token to split it. So if you can get that working then it should be
supported.

As it may be a bit difficult to do this maybe we need a DSL syntax to
offer an expression that can split this nicely, and you can chose line
terminators as: platform, windows, unix, both. Or something and you
can set it to both in your use case.





On Thu, Oct 29, 2015 at 11:38 PM, furchess123 <co...@hotmail.com> wrote:
> Ok, here's the workaround I have implemented to go past the above issue...
>
> Some MyConstants.java file:
>
>     public static final String SYSTEM_AGNOSTIC_NEWLINE_REGEX = "\r|\r\n|\n";
>
> Splitter route configuration in a RouteBuilder implementation:
>
>        TokenizerExpression tokenizerExpression = new TokenizerExpression();
>
> tokenizerExpression.setToken(MyConstants.SYSTEM_AGNOSTIC_NEWLINE_REGEX);  //
> tokenize by line separators
>         tokenizerExpression.setGroup(readerConfig.getLinesPerChunk());//
> group so many lines into one exchange
>         tokenizerExpression.setRegex(true);  // indicate that it is a
> regular expression, not simple string match
>
>         from(FILE_SPLITTER_ENDPOINT).routeId("fileSplitterRoute").
>             split(tokenizerExpression).
>                 streaming().          // enable streaming vs. reading all
> into memory
>                 parallelProcessing(readerConfig.isParallelProcessing()). //
> on/off concurrent processing of multiple chunks
>                 stopOnException().    // stop processing file if system
> exception occurs (handled by onException clause)
>                 *bean(new TokenizerCharRemover())*. // cleans junk chars
> inserted by Camel's tokenizer due to bug(?)
>                 unmarshal().csv().    // unmarshal each chunk to Java (list
> of String lists) using Camel's CSV component
>                 bean(csvHandler).     // hand each unmarshalled list of
> lines/fields to bean that parses and validates line content
>                 bean(importProcessor).// process codes for import (depending
> on operational mode and errors in exchange)
>                 to(AGGREGATE_ERRORS_ENDPOINT).      // delegate to nested
> route to update error report
>             end();
>
>
> TokenizerCharRemover.java:
>
> public class TokenizerCharRemover
> {
>     /**
>      * Pre-compiled regex pattern to match the instances of character
> sequences of the regular expression inserted by
>      * Camel's splitter's tokenizer between the file lines in the body of
> the exchange.  The input string that specifies
>      * the pattern is treated as a sequence of literal characters thanks to
> the {@link Pattern#LITERAL} flag.
>      */
>     private static final Pattern REPLACE_JUNK_PATTERN =
>         Pattern.compile(MyConstants.SYSTEM_AGNOSTIC_NEWLINE_REGEX,
> Pattern.LITERAL);
>
>
>     /**
>      * Replaces every instance of the {@link
> FileContext#SYSTEM_AGNOSTIC_NEWLINE_REGEX} character sequence in the
>      * exchange body with a simple '\n' line separator.
>      */
>     @SuppressWarnings("MethodMayBeStatic")
>     @Handler
>     public void cleanupLineSeparators(Exchange exchange)
>     {
>         String newBody =
> REPLACE_JUNK_PATTERN.matcher(exchange.getIn().getBody(String.class))
>             .replaceAll(Matcher.quoteReplacement("\n"));
>         exchange.getIn().setBody(newBody);
>     }
>
> }
>
> If there is a better solution, or if I have missed some obvious simple way
> to use the tokenizer that does not replace the matching line separators with
> the regex character sequence itself, please let me know! I'd very much
> appreciate that.
>
>
>
> --
> View this message in context: http://camel.465427.n5.nabble.com/correct-way-to-provide-regex-in-TokenizerExpression-tp5773192p5773221.html
> Sent from the Camel - Users mailing list archive at Nabble.com.



-- 
Claus Ibsen
-----------------
http://davsclaus.com @davsclaus
Camel in Action 2nd edition:
https://www.manning.com/books/camel-in-action-second-edition

Re: correct way to provide regex in TokenizerExpression?

Posted by furchess123 <co...@hotmail.com>.

Ok, here's the workaround I have implemented to go past the above issue...

Some MyConstants.java file:

    public static final String SYSTEM_AGNOSTIC_NEWLINE_REGEX = "\r|\r\n|\n";

Splitter route configuration in a RouteBuilder implementation:

       TokenizerExpression tokenizerExpression = new TokenizerExpression();
       
tokenizerExpression.setToken(MyConstants.SYSTEM_AGNOSTIC_NEWLINE_REGEX);  //
tokenize by line separators
        tokenizerExpression.setGroup(readerConfig.getLinesPerChunk());//
group so many lines into one exchange
        tokenizerExpression.setRegex(true);  // indicate that it is a
regular expression, not simple string match

        from(FILE_SPLITTER_ENDPOINT).routeId("fileSplitterRoute").
            split(tokenizerExpression).
                streaming().          // enable streaming vs. reading all
into memory
                parallelProcessing(readerConfig.isParallelProcessing()). //
on/off concurrent processing of multiple chunks
                stopOnException().    // stop processing file if system
exception occurs (handled by onException clause)
                *bean(new TokenizerCharRemover())*. // cleans junk chars
inserted by Camel's tokenizer due to bug(?) 
                unmarshal().csv().    // unmarshal each chunk to Java (list
of String lists) using Camel's CSV component
                bean(csvHandler).     // hand each unmarshalled list of
lines/fields to bean that parses and validates line content
                bean(importProcessor).// process codes for import (depending
on operational mode and errors in exchange)
                to(AGGREGATE_ERRORS_ENDPOINT).      // delegate to nested
route to update error report
            end();
 

TokenizerCharRemover.java:

public class TokenizerCharRemover
{
    /**
     * Pre-compiled regex pattern to match the instances of character
sequences of the regular expression inserted by
     * Camel's splitter's tokenizer between the file lines in the body of
the exchange.  The input string that specifies
     * the pattern is treated as a sequence of literal characters thanks to
the {@link Pattern#LITERAL} flag.
     */
    private static final Pattern REPLACE_JUNK_PATTERN =
        Pattern.compile(MyConstants.SYSTEM_AGNOSTIC_NEWLINE_REGEX,
Pattern.LITERAL);


    /**
     * Replaces every instance of the {@link
FileContext#SYSTEM_AGNOSTIC_NEWLINE_REGEX} character sequence in the
     * exchange body with a simple '\n' line separator.
     */
    @SuppressWarnings("MethodMayBeStatic")
    @Handler
    public void cleanupLineSeparators(Exchange exchange)
    {
        String newBody =
REPLACE_JUNK_PATTERN.matcher(exchange.getIn().getBody(String.class))
            .replaceAll(Matcher.quoteReplacement("\n"));
        exchange.getIn().setBody(newBody);
    }

}

If there is a better solution, or if I have missed some obvious simple way
to use the tokenizer that does not replace the matching line separators with
the regex character sequence itself, please let me know! I'd very much
appreciate that.



--
View this message in context: http://camel.465427.n5.nabble.com/correct-way-to-provide-regex-in-TokenizerExpression-tp5773192p5773221.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: correct way to provide regex in TokenizerExpression?

Posted by furchess123 <co...@hotmail.com>.

Forgot to mention: when I inspect the exchange that is expected to contain
the grouped N lines from the file that is being processed, I see that Camel
inserts the actual regex string I provided to the tokenizer between the
lines from the file! The exchange message looks like this:

"[Message:
linefromFile1*\r|\r\n|\n*lineFromFile2*\r|\r\n|\n*LineFromFile3*\r|\r\n|\n*...etc]

The sequence replace the original line separators. This seems to be a bug,
shouldn't be happening. 



--
View this message in context: http://camel.465427.n5.nabble.com/correct-way-to-provide-regex-in-TokenizerExpression-tp5773192p5773216.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: correct way to provide regex in TokenizerExpression?

Posted by furchess123 <co...@hotmail.com>.

I have played with it some more, and it seems clear that the tokenizer does
NOT support regular expressions as advertised. Moreover, it seems that there
is no way to write a system-agnostic file splitter that groups lines! I may
be wrong, and if so, can anyone PLEASE show me the proper way to do it??
Perhaps, add some real examples to the documentation?

Most examples seem to advocate system-specific tokenizing, e.g. using just
"\n" as a plain String to match a token separator in the payload. But what
if the file is being processed on Unix but was created on Windows? What if
the file has only "\r" separators? (My application has to deal with all
three types: \n, \r\n, \r).

I have seen examples online that suggest that the regex string to be used
might be "\n|\r\n" or "\n|\r\n|\r". Tried it: it didn't work. Camel creates
bogus lines that contain the '|' characters and inserts those lines into the
exchange. Thank you very much.

There just has to be an easy way to specify a list of possible token
delimiters, and the best way to do it might be via regular expressions. The
API documentation indicates that it is indeed supported. However, as I have
described, it didn't work for me at all. Any regex-language-specific
characters, such as '|', or '[', ']', etc end up being inserted into the
exchange as part of a junk line "read" from the file.

--
View this message in context: http://camel.465427.n5.nabble.com/correct-way-to-provide-regex-in-TokenizerExpression-tp5773192p5773207.html
Sent from the Camel - Users mailing list archive at Nabble.com.