You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@camel.apache.org by "jeevan.koteshwara" <je...@gmail.com> on 2011/08/08 19:29:58 UTC

Split large file into small files

I am trying to split a large fixed length record file (say 350K records) into
multiple files (each of 100k each). I thought of using
from(src).split().method(MySplitBean.class).streaming.to(destination). But,
this may give memoryproblems while processing large files (say 500K
records). Since "MySplitBean" should return a List object (which may contain
very huge data), I doubt is this a good approach.

Is there any other methods available to split the input file? 

--
View this message in context: http://camel.465427.n5.nabble.com/Split-large-file-into-small-files-tp4678470p4678470.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: Split large file into small files

Posted by Claus Ibsen <cl...@gmail.com>.
On Fri, Aug 12, 2011 at 6:26 PM, jeevan.koteshwara
<je...@gmail.com> wrote:
> Hi,
>   After splitting the messages, I am transforming them into different
> format and sending them to a single file.
>
> So, right now, after transforming, data is not sppending to the destination
> file (instead its over writing). You suggested to append them. As I am using
> a custom iterator and my bean returns this custom iterator object, how can I
> append them to the destination. I am not very certain how to append the
> messages.

Its the option on the file endpoint

.to("file:xxx?fileExist=Append");


>
> Please suggest on this.
>
> --
> View this message in context: http://camel.465427.n5.nabble.com/Split-large-file-into-small-files-tp4678470p4693912.html
> Sent from the Camel - Users mailing list archive at Nabble.com.
>



-- 
Claus Ibsen
-----------------
FuseSource
Email: cibsen@fusesource.com
Web: http://fusesource.com
Twitter: davsclaus, fusenews
Blog: http://davsclaus.blogspot.com/
Author of Camel in Action: http://www.manning.com/ibsen/

Re: Split large file into small files

Posted by "jeevan.koteshwara" <je...@gmail.com>.
Hi,
   After splitting the messages, I am transforming them into different
format and sending them to a single file.

So, right now, after transforming, data is not sppending to the destination
file (instead its over writing). You suggested to append them. As I am using
a custom iterator and my bean returns this custom iterator object, how can I
append them to the destination. I am not very certain how to append the
messages. 

Please suggest on this.

--
View this message in context: http://camel.465427.n5.nabble.com/Split-large-file-into-small-files-tp4678470p4693912.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: Split large file into small files

Posted by Claus Ibsen <cl...@gmail.com>.
On Fri, Aug 12, 2011 at 6:48 PM, jeevan.koteshwara
<je...@gmail.com> wrote:
> Sorry, I didnt see your reply...
>
> Thanks. Its appending the messages to the file now. Is there any ways to
> append the messages first and then write to the destination (I mean without
> using fileExist=Append, somewhere after the completion of split process)??
>

Its best to append to a file as you wont have to keep all data in memory.

However you can provide a custom AggregationStrategy to the Splitter
EIP where you can append the message how you like it, and then
afterwords send it to the file.

See example on wiki pages
http://camel.apache.org/splitter

If you got the Camel book then chapter 8 covers this in better detail.



> Thanks,
> Jeevan.
>
> --
> View this message in context: http://camel.465427.n5.nabble.com/Split-large-file-into-small-files-tp4678470p4693983.html
> Sent from the Camel - Users mailing list archive at Nabble.com.
>



-- 
Claus Ibsen
-----------------
FuseSource
Email: cibsen@fusesource.com
Web: http://fusesource.com
Twitter: davsclaus, fusenews
Blog: http://davsclaus.blogspot.com/
Author of Camel in Action: http://www.manning.com/ibsen/

Re: Split large file into small files

Posted by "jeevan.koteshwara" <je...@gmail.com>.
Sorry, I didnt see your reply...

Thanks. Its appending the messages to the file now. Is there any ways to
append the messages first and then write to the destination (I mean without
using fileExist=Append, somewhere after the completion of split process)??

Thanks,
Jeevan.

--
View this message in context: http://camel.465427.n5.nabble.com/Split-large-file-into-small-files-tp4678470p4693983.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: Split large file into small files

Posted by "jeevan.koteshwara" <je...@gmail.com>.
Hi Claus,
             are you suggesting something like below..

from(src).split().method(splitbean which returns an custom
Iterator).streaming().to(file:...?fileExist=Append);

Is that the correct approach? 

--
View this message in context: http://camel.465427.n5.nabble.com/Split-large-file-into-small-files-tp4678470p4693936.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: Split large file into small files

Posted by Claus Ibsen <cl...@gmail.com>.
On Fri, Aug 12, 2011 at 5:37 PM, jeevan.koteshwara
<je...@gmail.com> wrote:
> Hi Claus,
>             one more quwstion here.
>
> I am splitting my messages using custom iterator. But, I am seeing once the
> route is finished (i.e. when a file routed to destination), messages are
> getting overwrited in it.
>
> Say, my starting message will be "A,B,C". I split them into some chunks say
> "A" , "B" and "C". But when the route is finished, I am seeing only "C" in
> the file.
>
> My route is something like below.
>
> from(src).split().method(splitbean which returns an custom
> Iterator).streaming().to(dest);
>
> When I debug my code, I could see that the default aggreagtion (inside
> camel) is not able to get the old exchage data. (I am not sure about this).
>
> Could you please tell me where it might have went wrong.
>

Are you writing to the same file name. Then take a look at the
fileExist option on the file component
http://camel.apache.org/file2

It will by default override. Maybe you want to Append instead?

> --
> View this message in context: http://camel.465427.n5.nabble.com/Split-large-file-into-small-files-tp4678470p4693727.html
> Sent from the Camel - Users mailing list archive at Nabble.com.
>



-- 
Claus Ibsen
-----------------
FuseSource
Email: cibsen@fusesource.com
Web: http://fusesource.com
Twitter: davsclaus, fusenews
Blog: http://davsclaus.blogspot.com/
Author of Camel in Action: http://www.manning.com/ibsen/

Re: Split large file into small files

Posted by "jeevan.koteshwara" <je...@gmail.com>.
Hi Claus,
             one more quwstion here. 

I am splitting my messages using custom iterator. But, I am seeing once the
route is finished (i.e. when a file routed to destination), messages are
getting overwrited in it.

Say, my starting message will be "A,B,C". I split them into some chunks say
"A" , "B" and "C". But when the route is finished, I am seeing only "C" in
the file.

My route is something like below.

from(src).split().method(splitbean which returns an custom
Iterator).streaming().to(dest);

When I debug my code, I could see that the default aggreagtion (inside
camel) is not able to get the old exchage data. (I am not sure about this).

Could you please tell me where it might have went wrong.

--
View this message in context: http://camel.465427.n5.nabble.com/Split-large-file-into-small-files-tp4678470p4693727.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: Split large file into small files

Posted by "jeevan.koteshwara" <je...@gmail.com>.
Thanks Calus.

Now, I got a picture how to handle the split according to my requirement.

As you suggested, I should use a custom Iterator, something like below one.

http://www.ecreate.co.uk/pages/inputStreamIterator.php


--
View this message in context: http://camel.465427.n5.nabble.com/Split-large-file-into-small-files-tp4678470p4682870.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: Split large file into small files

Posted by Claus Ibsen <cl...@gmail.com>.
On Mon, Aug 8, 2011 at 7:29 PM, jeevan.koteshwara
<je...@gmail.com> wrote:
> I am trying to split a large fixed length record file (say 350K records) into
> multiple files (each of 100k each). I thought of using
> from(src).split().method(MySplitBean.class).streaming.to(destination). But,
> this may give memoryproblems while processing large files (say 500K
> records). Since "MySplitBean" should return a List object (which may contain
> very huge data), I doubt is this a good approach.
>
> Is there any other methods available to split the input file?
>

You could in fact just use a regular java bean to do all the file
splitting manually.

Alternatively if you want to use the Camel splitter, then you can
return an iterator, that iterates a custom InputStream, by which you
read the source file in chunks, eg until you have read 50K lines (or
the end of the source file).

Then it would all be streaming based and you would not read the entire
file into memory.

But you would then have to fiddle a bit with low level code with a
custom iterator, and a custom InputStream.



> --
> View this message in context: http://camel.465427.n5.nabble.com/Split-large-file-into-small-files-tp4678470p4678470.html
> Sent from the Camel - Users mailing list archive at Nabble.com.
>



-- 
Claus Ibsen
-----------------
FuseSource
Email: cibsen@fusesource.com
Web: http://fusesource.com
Twitter: davsclaus, fusenews
Blog: http://davsclaus.blogspot.com/
Author of Camel in Action: http://www.manning.com/ibsen/

Re: Split large file into small files

Posted by "jeevan.koteshwara" <je...@gmail.com>.
Thanks a lot Clause.

Within the borders of my requirement the first approach (which you mentioned
above) looks very much implementable.

Thanks and Regards,
Jeevan Mithyantha.

--
View this message in context: http://camel.465427.n5.nabble.com/Split-large-file-into-small-files-tp4678470p4701406.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: Split large file into small files

Posted by Claus Ibsen <cl...@gmail.com>.
On Sun, Aug 14, 2011 at 12:57 AM, jeevan.koteshwara
<je...@gmail.com> wrote:
> Hi Christian,
>                to be give you better picture, my requirement goes like
> this.
>
> I need to transfer a fixed length record file to a destination Meanwhile, my
> route is responsible for transforming into required format (say to CSV or to
> XML).
>
> Now, the input file may be too big. It may contain more records (say about
> 500k) in it. So, if I use split(body().tokenize("\n"), new
> CustomAggregationStrategy()).streaming(), may cause delay and also it may
> lead to out of memory exception while aggregating the messages.
>
> So, I thought of using split().method(CustomBean.class).streaming(), where
> my CustomBean return an Iterator (custom iterator, which will iterate
> through the input message stream and will split theincoming message based on
> line numbers) object. In this case, everything looks fine, but the end file
> will be overided with the latest splitted message, instead of appending
> every message.
>
> Cluase suggested to use "fileExist=Append" option. But, as per my
> requirement, after this split and transform process, need to do some more
> actions on the route. E.g.
>
> RouteDefinition routeDef =
> from(src).split().method(CustomeBean.class).streaming();
> routeDef = routeDef.bean(ActioneBean1()); //could be zipping action etc
> routeDef = routeDef.bean(ActionBean());
> routeDef.to(dest);
>
> In this case, if I split messages and if I didnt aggregate them, then I am
> affraid whether my action beans could perform correctly or not (I am not
> certain on this).
>

The aggregator on the splitter is only invoked when the sub message is complete.
So if you invoke your action messages as part of the sub message
routing, then your work has been done.

So would this not work for you?

from X
  split XXXX
     action bean 1
     action bean 2
     to file (append)
   end // end splitter
// after split, but no more work to do

The tricky is that the action beans is invoked with a stream type and
if they need to alter the message, then need to return a stream type
as well. So that can be tricky.

You could consider dividing this into 2 steps.

from X
  split XXXX
     to file2 (write using unique file name)

from file2
  split XXXX
     action bean 1
     action bean 2
     to file (append)

And in this 2nd route if the files are smaller and you can contain the
data in memory you can avoid using the streaming mode and work on the
entire message body in memory and from within your action beans if
thats easier.



> So, I am verifying is there any ways to aggregate the split messages
> (without using split(body().tokenize("\n"), new MyAggreagationStrategy()),
> because this will cause out of memory error).
>
>
> --
> View this message in context: http://camel.465427.n5.nabble.com/Split-large-file-into-small-files-tp4678470p4697202.html
> Sent from the Camel - Users mailing list archive at Nabble.com.
>



-- 
Claus Ibsen
-----------------
FuseSource
Email: cibsen@fusesource.com
Web: http://fusesource.com
Twitter: davsclaus, fusenews
Blog: http://davsclaus.blogspot.com/
Author of Camel in Action: http://www.manning.com/ibsen/

Re: Split large file into small files

Posted by "jeevan.koteshwara" <je...@gmail.com>.
Hi Christian,
                to be give you better picture, my requirement goes like
this.

I need to transfer a fixed length record file to a destination Meanwhile, my
route is responsible for transforming into required format (say to CSV or to
XML).

Now, the input file may be too big. It may contain more records (say about
500k) in it. So, if I use split(body().tokenize("\n"), new
CustomAggregationStrategy()).streaming(), may cause delay and also it may
lead to out of memory exception while aggregating the messages.

So, I thought of using split().method(CustomBean.class).streaming(), where
my CustomBean return an Iterator (custom iterator, which will iterate
through the input message stream and will split theincoming message based on
line numbers) object. In this case, everything looks fine, but the end file
will be overided with the latest splitted message, instead of appending
every message.

Cluase suggested to use "fileExist=Append" option. But, as per my
requirement, after this split and transform process, need to do some more
actions on the route. E.g.

RouteDefinition routeDef =
from(src).split().method(CustomeBean.class).streaming();
routeDef = routeDef.bean(ActioneBean1()); //could be zipping action etc
routeDef = routeDef.bean(ActionBean());
routeDef.to(dest);

In this case, if I split messages and if I didnt aggregate them, then I am
affraid whether my action beans could perform correctly or not (I am not
certain on this).

So, I am verifying is there any ways to aggregate the split messages
(without using split(body().tokenize("\n"), new MyAggreagationStrategy()),
because this will cause out of memory error).


--
View this message in context: http://camel.465427.n5.nabble.com/Split-large-file-into-small-files-tp4678470p4697202.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: Split large file into small files

Posted by Christian Müller <ch...@gmail.com>.
Hello Jeevan!

Sorry for answering so late...
If your input file is line oriented (CSV or fixed length), you do not have
to implement your own splitter.

from("file://foo.txt").split(body().tokenize("\n")).streaming().to("...");

or

from("file://foo.csv").split(body().tokenize(",")).streaming().to("...");

should meet your requirements (or I miss something).

Best,
Christian

Re: Split large file into small files

Posted by "jeevan.koteshwara" <je...@gmail.com>.
Christian, 
             thanks for the response. I have few doubts based on my
requirement.

I am trying to develop a custom splitter, a kind of bean, which would handle
splitting based on number of lines in the input file (requirement is to
split a single input file into multiple output files based on the number of
lines in the input file, say 50k lines). You suggested to return an Iterator
from my custom splitter bean. But, at some point of time, I think we gonna
load whole contents of input file into the memory.

My bean (just a sample code) looks like below.

public Iterator<Message> splitMessage(Exchange exchange) {
        BufferedReader inputReader =
exchange.getIn().getBody(BufferedReader.class);

        List<Message> messages = new ArrayList<Message>();
        String line = null;
        int count = 0;
        int fileNameCount = 0;
        StringBuffer sb = new StringBuffer();
        try {
            while (null != (line = inputReader.readLine())) {
                sb.append(line);
                count++;

                if (count == 5) {
                    messages.add(createNewOutput(sb,
"Sample"+fileNameCount+".txt"));
                    count = 0;
                    sb = new StringBuffer();
                    fileNameCount++;
                }
            }

            inputReader.close();
        } catch(Exception ex) {
            ex.printStackTrace();
        }

        return messages.iterator();
    }

    private Message createNewOutput(StringBuffer sb, String fileName) throws
EOFException{
        Message message = new DefaultMessage();
        message.setBody(sb.toString());
        message.setHeader(Exchange.FILE_NAME, fileName);
        return message;
    }

So, while adding the contents into the list object, going to load complete
file to the memory. Is there any other ways to avoid this?

Please make me correct if I my understanding is wrong here.


--
View this message in context: http://camel.465427.n5.nabble.com/Split-large-file-into-small-files-tp4678470p4681261.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: Split large file into small files

Posted by Christian Müller <ch...@gmail.com>.
Yes, it is. :-)
You should return an Iterator. If your file is XML, you may interested in
this improvement: https://issues.apache.org/jira/browse/CAMEL-3998

If you file is a normal text file which you can split after the line feed,
you can write:
    from("...")
        .split(body().tokenize("\n")).streaming()
            .to("...");

Best,
Christian