You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Eranga Heshan <er...@gmail.com> on 2017/08/14 03:09:15 UTC

Distribute crawling of a URL list using Flink

Hi all,

I am fairly new to Flink. I have this project where I have a list of URLs
(In one node) which need to be crawled distributedly. Then for each URL, I
need the serialized crawled result to be written to a single text file.

I want to know if there are similar projects which I can look into or an
idea on how to implement this.

Thanks & Regards,



Eranga Heshan
*Undergraduate*
Computer Science & Engineering
University of Moratuwa
Mobile:  +94 71 138 2686 <%2B94%2071%20552%202087>
Email: eranga.h.n@gmail.com
<https://www.facebook.com/erangaheshan>   <https://twitter.com/erangaheshan>
   <https://www.linkedin.com/in/erangaheshan>

Re: Distribute crawling of a URL list using Flink

Posted by Eranga Heshan <er...@gmail.com>.

Thank you Aljoscha :-) I actually need it for a Kafka stream, so I use
DataStream API anyway.

Regards,



Eranga Heshan
*Undergraduate*
Computer Science & Engineering
University of Moratuwa
Mobile:  +94 71 138 2686 <%2B94%2071%20552%202087>
Email: eranga.h.n@gmail.com
<https://www.facebook.com/erangaheshan>   <https://twitter.com/erangaheshan>
   <https://www.linkedin.com/in/erangaheshan>

On Fri, Aug 25, 2017 at 5:53 PM, Aljoscha Krettek <al...@apache.org>
wrote:

> Hi,
>
> It is not available for the Batch API, you would have to use the
> DataStream API.
>
> Best,
> Aljoscha
>
> On 15. Aug 2017, at 01:16, Kien Truong <du...@gmail.com> wrote:
>
> Hi,
>
> Admittedly, I have not suggested this because I thought it was not
> available for batch API.
>
> Regards,
> Kien
> On Aug 15, 2017, at 00:06, Nico Kruber <ni...@data-artisans.com> wrote:
>>
>> Hi Eranga and Kien,
>> Flink supports asynchronous IO since version 1.2, see [1] for details.
>>
>> You basically pack your URL download into the asynchronous part and collect
>> the resulting string for further processing in your pipeline.
>>
>>
>>
>> Nico
>>
>>
>> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
>> asyncio.html
>>
>> On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:
>>
>>
>>>    Hi,
>>>
>>>
>>>
>>>  While this task is quite trivial to do with Flink Dataset API, using
>>>
>>>  readTextFile to read the input and
>>>
>>>
>>>
>>>  a flatMap function to perform the downloading, it might not be a good idea.
>>>
>>>
>>>
>>>  The download process is I/O bound, and will block the synchronous
>>>
>>>  flatMap function,
>>>
>>>
>>>
>>>  so the throughput will not be very good.
>>>
>>>
>>>
>>>
>>>
>>>  Until Flink supports asynchronous functions, I suggest you looks elsewhere.
>>>
>>>
>>>
>>>  An example with master-workers architecture using Akka can be found here
>>>
>>>
>>>
>>>
>>>   https://github.com/typesafehub/activator-akka-distributed-workers
>>>
>>>
>>>
>>>
>>>
>>>  Regards,
>>>
>>>
>>>
>>>  Kien
>>>
>>>
>>>
>>>  On 8/14/2017 10:09 AM, Eranga Heshan wrote:
>>>
>>>
>>>
>>>>     Hi all,
>>>>
>>>>
>>>>
>>>>  I am fairly new to Flink. I have this project where I have a list of
>>>>
>>>>  URLs (In one node) which need to be crawled distributedly. Then for
>>>>
>>>>  each URL, I need the serialized crawled result to be written to a
>>>>
>>>>  single text file.
>>>>
>>>>
>>>>
>>>>  I want to know if there are similar projects which I can look into or
>>>>
>>>>  an idea on how to implement this.
>>>>
>>>>
>>>>
>>>>  Thanks & Regards,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>  Eranga Heshan
>>>>
>>>>  /Undergraduate/
>>>>
>>>>  Computer Science & Engineering
>>>>
>>>>  University of Moratuwa
>>>>
>>>>  Mobile: +94 71 138 2686 <+94%2071%20138%202686> <tel:%2B94%2071%20552%202087 <%2B94%2071%20552%202087>>
>>>>
>>>>  Email: eranga.h.n@gmail.com <mailto:eranga.h.n@gmail.com <er...@gmail.com>>
>>>>
>>>>  <
>>>>    https://www.facebook.com/erangaheshan>
>>>>
>>>>  <
>>>>    https://twitter.com/erangaheshan>
>>>>
>>>>  <
>>>>    https://www.linkedin.com/in/erangaheshan>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Distribute crawling of a URL list using Flink

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,

It is not available for the Batch API, you would have to use the DataStream API.

Best,
Aljoscha

> On 15. Aug 2017, at 01:16, Kien Truong <du...@gmail.com> wrote:
> 
> Hi, 
> 
> Admittedly, I have not suggested this because I thought it was not available for batch API. 
> 
> Regards, 
> Kien 
> On Aug 15, 2017, at 00:06, Nico Kruber <nico@data-artisans.com <ma...@data-artisans.com>> wrote:
> Hi Eranga and Kien,
> Flink supports asynchronous IO since version 1.2, see [1] for details.
> 
> You basically pack your URL download into the asynchronous part and collect 
> the resulting string for further processing in your pipeline.
> 
> 
> 
> Nico
> 
> 
> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream <https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream>/
> asyncio.html
> 
> On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:
> 
>  
> 
>    Hi,
>   
> 
>   
>  While this task is quite trivial to do with Flink Dataset API, using
>   
>  readTextFile to read the input and
>   
> 
>   
>  a flatMap function to perform the downloading, it might not be a good idea.
>   
> 
>   
>  The download process is I/O bound, and will block the synchronous
>   
>  flatMap function,
>   
> 
>   
>  so the throughput will not be very good.
>   
> 
>   
> 
>   
>  Until Flink supports asynchronous functions, I suggest you looks elsewhere.
>   
> 
>   
>  An example with master-workers architecture using Akka can be found here
>   
> 
>   
> 
>   https://github.com/typesafehub/activator-akka-distributed-workers <https://github.com/typesafehub/activator-akka-distributed-workers>
>   
> 
>   
> 
>   
>  Regards,
>   
> 
>   
>  Kien
>   
> 
>   
>  On 8/14/2017 10:09 AM, Eranga Heshan wrote:
>   
> 
>   
> 
>     Hi all,
>    
> 
>    
>  I am fairly new to Flink. I have this project where I have a list of
>    
>  URLs (In one node) which need to be crawled distributedly. Then for
>    
>  each URL, I need the serialized crawled result to be written to a
>    
>  single text file.
>    
> 
>    
>  I want to know if there are similar projects which I can look into or
>    
>  an idea on how to implement this.
>    
> 
>    
>  Thanks & Regards,
>    
> 
>    
> 
>    
> 
>    
> 
>    
>  Eranga Heshan
>    
>  /Undergraduate/
>    
>  Computer Science & Engineering
>    
>  University of Moratuwa
>    
>  Mobile: +94 71 138 2686 <tel:%2B94%2071%20552%202087>
>    
>  Email: eranga.h.n@gmail.com <ma...@gmail.com>
>    
>  <
>    https://www.facebook.com/erangaheshan <https://www.facebook.com/erangaheshan>>
>    
>  <
>    https://twitter.com/erangaheshan <https://twitter.com/erangaheshan>>
>    
>  <
>    https://www.linkedin.com/in/erangaheshan <https://www.linkedin.com/in/erangaheshan>>
>    
> 
>   
> 
>  
>

Re: Distribute crawling of a URL list using Flink

Posted by Kien Truong <du...@gmail.com>.

Hi, 

Admittedly, I have not suggested this because I thought it was not available for batch API. 

Regards, 
Kien 


On Aug 15, 2017, 00:06, at 00:06, Nico Kruber <ni...@data-artisans.com> wrote:
>Hi Eranga and Kien,
>Flink supports asynchronous IO since version 1.2, see [1] for details.
>
>You basically pack your URL download into the asynchronous part and
>collect 
>the resulting string for further processing in your pipeline.
>
>
>
>Nico
>
>
>[1]
>https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
>asyncio.html
>
>On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:
>> Hi,
>> 
>> While this task is quite trivial to do with Flink Dataset API, using
>> readTextFile to read the input and
>> 
>> a flatMap function to perform the downloading, it might not be a good
>idea.
>> 
>> The download process is I/O bound, and will block the synchronous
>> flatMap function,
>> 
>> so the throughput will not be very good.
>> 
>> 
>> Until Flink supports asynchronous functions, I suggest you looks
>elsewhere.
>> 
>> An example with master-workers architecture using Akka can be found
>here
>> 
>> https://github.com/typesafehub/activator-akka-distributed-workers
>> 
>> 
>> Regards,
>> 
>> Kien
>> 
>> On 8/14/2017 10:09 AM, Eranga Heshan wrote:
>> > Hi all,
>> > 
>> > I am fairly new to Flink. I have this project where I have a list
>of
>> > URLs (In one node) which need to be crawled distributedly. Then for
>> > each URL, I need the serialized crawled result to be written to a
>> > single text file.
>> > 
>> > I want to know if there are similar projects which I can look into
>or
>> > an idea on how to implement this.
>> > 
>> > Thanks & Regards,
>> > 
>> > 
>> > 
>> > 
>> > Eranga Heshan
>> > /Undergraduate/
>> > Computer Science & Engineering
>> > University of Moratuwa
>> > Mobile: 	+94 71 138 2686 <tel:%2B94%2071%20552%202087>
>> > Email: 	eranga.h.n@gmail.com <ma...@gmail.com>
>> > <https://www.facebook.com/erangaheshan>
>> > <https://twitter.com/erangaheshan>
>> > <https://www.linkedin.com/in/erangaheshan>

Re: Distribute crawling of a URL list using Flink

Posted by Eranga Heshan <er...@gmail.com>.

Thanks for your quick replies, Nico and Kien. Since I am using Flink-1.3.0,
I will try Nico's idea. I might bug you again for my future problems. 😊

Regards,



Eranga Heshan
*Undergraduate*
Computer Science & Engineering
University of Moratuwa
Mobile:  +94 71 138 2686 <%2B94%2071%20552%202087>
Email: eranga.h.n@gmail.com
<https://www.facebook.com/erangaheshan>   <https://twitter.com/erangaheshan>
   <https://www.linkedin.com/in/erangaheshan>

On Mon, Aug 14, 2017 at 10:36 PM, Nico Kruber <ni...@data-artisans.com>
wrote:

> Hi Eranga and Kien,
> Flink supports asynchronous IO since version 1.2, see [1] for details.
>
> You basically pack your URL download into the asynchronous part and collect
> the resulting string for further processing in your pipeline.
>
>
>
> Nico
>
>
> [1] https://ci.apache.org/projects/flink/flink-docs-
> release-1.3/dev/stream/
> asyncio.html
>
> On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:
> > Hi,
> >
> > While this task is quite trivial to do with Flink Dataset API, using
> > readTextFile to read the input and
> >
> > a flatMap function to perform the downloading, it might not be a good
> idea.
> >
> > The download process is I/O bound, and will block the synchronous
> > flatMap function,
> >
> > so the throughput will not be very good.
> >
> >
> > Until Flink supports asynchronous functions, I suggest you looks
> elsewhere.
> >
> > An example with master-workers architecture using Akka can be found here
> >
> > https://github.com/typesafehub/activator-akka-distributed-workers
> >
> >
> > Regards,
> >
> > Kien
> >
> > On 8/14/2017 10:09 AM, Eranga Heshan wrote:
> > > Hi all,
> > >
> > > I am fairly new to Flink. I have this project where I have a list of
> > > URLs (In one node) which need to be crawled distributedly. Then for
> > > each URL, I need the serialized crawled result to be written to a
> > > single text file.
> > >
> > > I want to know if there are similar projects which I can look into or
> > > an idea on how to implement this.
> > >
> > > Thanks & Regards,
> > >
> > >
> > >
> > >
> > > Eranga Heshan
> > > /Undergraduate/
> > > Computer Science & Engineering
> > > University of Moratuwa
> > > Mobile:     +94 71 138 2686 <tel:%2B94%2071%20552%202087>
> > > Email:      eranga.h.n@gmail.com <ma...@gmail.com>
> > > <https://www.facebook.com/erangaheshan>
> > > <https://twitter.com/erangaheshan>
> > > <https://www.linkedin.com/in/erangaheshan>
>
>

Re: Distribute crawling of a URL list using Flink

Posted by Nico Kruber <ni...@data-artisans.com>.

Hi Eranga and Kien,
Flink supports asynchronous IO since version 1.2, see [1] for details.

You basically pack your URL download into the asynchronous part and collect 
the resulting string for further processing in your pipeline.



Nico


[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
asyncio.html

On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:
> Hi,
> 
> While this task is quite trivial to do with Flink Dataset API, using
> readTextFile to read the input and
> 
> a flatMap function to perform the downloading, it might not be a good idea.
> 
> The download process is I/O bound, and will block the synchronous
> flatMap function,
> 
> so the throughput will not be very good.
> 
> 
> Until Flink supports asynchronous functions, I suggest you looks elsewhere.
> 
> An example with master-workers architecture using Akka can be found here
> 
> https://github.com/typesafehub/activator-akka-distributed-workers
> 
> 
> Regards,
> 
> Kien
> 
> On 8/14/2017 10:09 AM, Eranga Heshan wrote:
> > Hi all,
> > 
> > I am fairly new to Flink. I have this project where I have a list of
> > URLs (In one node) which need to be crawled distributedly. Then for
> > each URL, I need the serialized crawled result to be written to a
> > single text file.
> > 
> > I want to know if there are similar projects which I can look into or
> > an idea on how to implement this.
> > 
> > Thanks & Regards,
> > 
> > 
> > 
> > 
> > Eranga Heshan
> > /Undergraduate/
> > Computer Science & Engineering
> > University of Moratuwa
> > Mobile: 	+94 71 138 2686 <tel:%2B94%2071%20552%202087>
> > Email: 	eranga.h.n@gmail.com <ma...@gmail.com>
> > <https://www.facebook.com/erangaheshan>
> > <https://twitter.com/erangaheshan>
> > <https://www.linkedin.com/in/erangaheshan>

Re: Distribute crawling of a URL list using Flink

Posted by Kien Truong <du...@gmail.com>.

Hi,

While this task is quite trivial to do with Flink Dataset API, using 
readTextFile to read the input and

a flatMap function to perform the downloading, it might not be a good idea.

The download process is I/O bound, and will block the synchronous 
flatMap function,

so the throughput will not be very good.


Until Flink supports asynchronous functions, I suggest you looks elsewhere.

An example with master-workers architecture using Akka can be found here

https://github.com/typesafehub/activator-akka-distributed-workers


Regards,

Kien



On 8/14/2017 10:09 AM, Eranga Heshan wrote:
> Hi all,
>
> I am fairly new to Flink. I have this project where I have a list of 
> URLs (In one node) which need to be crawled distributedly. Then for 
> each URL, I need the serialized crawled result to be written to a 
> single text file.
>
> I want to know if there are similar projects which I can look into or 
> an idea on how to implement this.
>
> Thanks & Regards,
>
>
>
> 	
> Eranga Heshan
> /Undergraduate/
> Computer Science & Engineering
> University of Moratuwa
> Mobile: 	+94 71 138 2686 <tel:%2B94%2071%20552%202087>
> Email: 	eranga.h.n@gmail.com <ma...@gmail.com>
> <https://www.facebook.com/erangaheshan> 
> <https://twitter.com/erangaheshan> 
> <https://www.linkedin.com/in/erangaheshan>
>