You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Flavio Pompermaier <po...@okkam.it> on 2016/09/26 16:13:43 UTC

Best way to trigger dataset sampling

Hi to all,

I have a use case where I need to tell a Flink cluster to give me a sample
of X records using parametrizable sampling functions. Is there any best
practice or advice to do that?

Should I create a Remote ExecutionEnvironment or should I use the Flink
client (I don't know if it uses REST services or RPC or whatever)?
Is there any java snippet for that?

Best,
Flavio

Re: Best way to trigger dataset sampling

Posted by Flavio Pompermaier <po...@okkam.it>.
I think I'll probably end with submitting the job through YARN in order to
have a more standard approach :)

Thanks,
Flavio

On Wed, Sep 28, 2016 at 5:19 PM, Maximilian Michels <mx...@apache.org> wrote:

> I meant that you simply keep the sampling jar on the machine where you
> want to sample. However, you mentioned that it is a requirement for it
> to be on the cluster.
>
> Cheers,
> Max
>
> On Tue, Sep 27, 2016 at 3:18 PM, Flavio Pompermaier
> <po...@okkam.it> wrote:
> > Hi max,
> > that's exactly what I was looking for. What do you mean for 'the best
> thing
> > is if you keep a local copy of your sampling jars and work directly with
> > them'?
> >
> > Best,
> > Flavio
> >
> > On Tue, Sep 27, 2016 at 2:35 PM, Maximilian Michels <mx...@apache.org>
> wrote:
> >>
> >> Hi Flavio,
> >>
> >> This is not really possible at the moment. Though there is a workaround.
> >> You can create a dummy jar file (may be empty). Then you can use
> >>
> >> ./flink run -C hdfs:///path/to/cluster.jar -c org.package.SampleClass
> >> /path/to/dummy.jar
> >>
> >> That way Flink will include your cluster jar and you can load all
> classes
> >> necessary.
> >>
> >> Alternatively, using the Remote Environment, this looks like this:
> >>
> >> public static void main(String[] args) throws Exception {
> >>
> >>    final RemoteEnvironment env = new RemoteEnvironment(
> >>       "remoteHost",
> >>       6123,
> >>       new Configuration(),
> >>       new String[0],
> >>       new URL[]{
> >>          new URL("file:///path/to/sample.jar"),
> >>          new
> >> URL("file:///Users/max/Dev/flink/build-target/lib/flink-
> dist_2.10-1.2-SNAPSHOT.jar")});
> >>    URLClassLoader classLoader = new
> >> URLClassLoader(env.globalClasspaths.toArray(new URL[0]));
> >>
> >>    Class<?> clazz =
> >> classLoader.loadClass("org.package.sample.SampleClass");
> >>
> >>    Method main = clazz.getDeclaredMethod("sampleMethod",
> >> ExecutionEnvironment.class);
> >>
> >>    // pass environment as an argument to your sample method
> >>    // the method should return the results of the execution
> >>    Object sampleResult = main.invoke(null, env);
> >> }
> >>
> >>
> >> Beware, this is extremely hacky. We should have a better way to invoke
> jar
> >> files remotely. Honestly, the best thing is if you keep a local copy of
> your
> >> sampling jars and work directly with them.
> >>
> >> Cheers,
> >> Max
> >>
> >> On Tue, Sep 27, 2016 at 12:25 PM, Flavio Pompermaier
> >> <po...@okkam.it> wrote:
> >>>
> >>> Hi Max,
> >>> actually I have a jar containing sampling jobs and I need to collect
> >>> results from a client.
> >>> I've tried to use ExecutionEnvironment.createRemoteEnvironment but I
> fear
> >>> that it's not the right way to do that because
> >>> I just need to tell the cluster the main class and the parameters to
> run
> >>> the job (and where the jar file is on HDFS).
> >>>
> >>> Best,
> >>> Flavio
> >>>
> >>> On Tue, Sep 27, 2016 at 12:06 PM, Maximilian Michels <mx...@apache.org>
> >>> wrote:
> >>>>
> >>>> Hi Flavio,
> >>>>
> >>>> Do you want to sample from a running batch job? That would be like
> >>>> Queryable State in streaming jobs but it is not supported in batch
> >>>> mode.
> >>>>
> >>>> Cheers,
> >>>> Max
> >>>>
> >>>>
> >>>> On Mon, Sep 26, 2016 at 6:13 PM, Flavio Pompermaier
> >>>> <po...@okkam.it> wrote:
> >>>> > Hi to all,
> >>>> >
> >>>> > I have a use case where I need to tell a Flink cluster to give me a
> >>>> > sample
> >>>> > of X records using parametrizable sampling functions. Is there any
> >>>> > best
> >>>> > practice or advice to do that?
> >>>> >
> >>>> > Should I create a Remote ExecutionEnvironment or should I use the
> >>>> > Flink
> >>>> > client (I don't know if it uses REST services or RPC or whatever)?
> >>>> > Is there any java snippet for that?
> >>>> >
> >>>> > Best,
> >>>> > Flavio
> >>>> >
> >>>
> >>>
> >>>
> >>>
> >>
> >
> >
>

Re: Best way to trigger dataset sampling

Posted by Maximilian Michels <mx...@apache.org>.
I meant that you simply keep the sampling jar on the machine where you
want to sample. However, you mentioned that it is a requirement for it
to be on the cluster.

Cheers,
Max

On Tue, Sep 27, 2016 at 3:18 PM, Flavio Pompermaier
<po...@okkam.it> wrote:
> Hi max,
> that's exactly what I was looking for. What do you mean for 'the best thing
> is if you keep a local copy of your sampling jars and work directly with
> them'?
>
> Best,
> Flavio
>
> On Tue, Sep 27, 2016 at 2:35 PM, Maximilian Michels <mx...@apache.org> wrote:
>>
>> Hi Flavio,
>>
>> This is not really possible at the moment. Though there is a workaround.
>> You can create a dummy jar file (may be empty). Then you can use
>>
>> ./flink run -C hdfs:///path/to/cluster.jar -c org.package.SampleClass
>> /path/to/dummy.jar
>>
>> That way Flink will include your cluster jar and you can load all classes
>> necessary.
>>
>> Alternatively, using the Remote Environment, this looks like this:
>>
>> public static void main(String[] args) throws Exception {
>>
>>    final RemoteEnvironment env = new RemoteEnvironment(
>>       "remoteHost",
>>       6123,
>>       new Configuration(),
>>       new String[0],
>>       new URL[]{
>>          new URL("file:///path/to/sample.jar"),
>>          new
>> URL("file:///Users/max/Dev/flink/build-target/lib/flink-dist_2.10-1.2-SNAPSHOT.jar")});
>>    URLClassLoader classLoader = new
>> URLClassLoader(env.globalClasspaths.toArray(new URL[0]));
>>
>>    Class<?> clazz =
>> classLoader.loadClass("org.package.sample.SampleClass");
>>
>>    Method main = clazz.getDeclaredMethod("sampleMethod",
>> ExecutionEnvironment.class);
>>
>>    // pass environment as an argument to your sample method
>>    // the method should return the results of the execution
>>    Object sampleResult = main.invoke(null, env);
>> }
>>
>>
>> Beware, this is extremely hacky. We should have a better way to invoke jar
>> files remotely. Honestly, the best thing is if you keep a local copy of your
>> sampling jars and work directly with them.
>>
>> Cheers,
>> Max
>>
>> On Tue, Sep 27, 2016 at 12:25 PM, Flavio Pompermaier
>> <po...@okkam.it> wrote:
>>>
>>> Hi Max,
>>> actually I have a jar containing sampling jobs and I need to collect
>>> results from a client.
>>> I've tried to use ExecutionEnvironment.createRemoteEnvironment but I fear
>>> that it's not the right way to do that because
>>> I just need to tell the cluster the main class and the parameters to run
>>> the job (and where the jar file is on HDFS).
>>>
>>> Best,
>>> Flavio
>>>
>>> On Tue, Sep 27, 2016 at 12:06 PM, Maximilian Michels <mx...@apache.org>
>>> wrote:
>>>>
>>>> Hi Flavio,
>>>>
>>>> Do you want to sample from a running batch job? That would be like
>>>> Queryable State in streaming jobs but it is not supported in batch
>>>> mode.
>>>>
>>>> Cheers,
>>>> Max
>>>>
>>>>
>>>> On Mon, Sep 26, 2016 at 6:13 PM, Flavio Pompermaier
>>>> <po...@okkam.it> wrote:
>>>> > Hi to all,
>>>> >
>>>> > I have a use case where I need to tell a Flink cluster to give me a
>>>> > sample
>>>> > of X records using parametrizable sampling functions. Is there any
>>>> > best
>>>> > practice or advice to do that?
>>>> >
>>>> > Should I create a Remote ExecutionEnvironment or should I use the
>>>> > Flink
>>>> > client (I don't know if it uses REST services or RPC or whatever)?
>>>> > Is there any java snippet for that?
>>>> >
>>>> > Best,
>>>> > Flavio
>>>> >
>>>
>>>
>>>
>>>
>>
>
>

Re: Best way to trigger dataset sampling

Posted by Flavio Pompermaier <po...@okkam.it>.
Hi max,
that's exactly what I was looking for. What do you mean for 'the best thing
is if you keep a local copy of your sampling jars and work directly with
them'?

Best,
Flavio

On Tue, Sep 27, 2016 at 2:35 PM, Maximilian Michels <mx...@apache.org> wrote:

> Hi Flavio,
>
> This is not really possible at the moment. Though there is a workaround.
> You can create a dummy jar file (may be empty). Then you can use
>
> ./flink run -C hdfs:///path/to/cluster.jar -c org.package.SampleClass
> /path/to/dummy.jar
>
> That way Flink will include your cluster jar and you can load all classes
> necessary.
>
> Alternatively, using the Remote Environment, this looks like this:
>
> public static void main(String[] args) throws Exception {
>
>    final RemoteEnvironment env = new RemoteEnvironment(
>       "remoteHost",
>       6123,
>       new Configuration(),
>       new String[0],
>       new URL[]{
>          new URL("file:///path/to/sample.jar"),
>          new URL("file:///Users/max/Dev/flink/build-target/lib/flink-dist_2.10-1.2-SNAPSHOT.jar")});
>    URLClassLoader classLoader = new URLClassLoader(env.globalClasspaths.toArray(new URL[0]));
>
>    Class<?> clazz = classLoader.loadClass("org.package.sample.SampleClass");
>
>    Method main = clazz.getDeclaredMethod("sampleMethod", ExecutionEnvironment.class);
>
>    // pass environment as an argument to your sample method
>    // the method should return the results of the execution
>    Object sampleResult = main.invoke(null, env);
> }
>
>
> Beware, this is extremely hacky. We should have a better way to invoke jar
> files remotely. Honestly, the best thing is if you keep a local copy of
> your sampling jars and work directly with them.
>
> Cheers,
> Max
>
> On Tue, Sep 27, 2016 at 12:25 PM, Flavio Pompermaier <pompermaier@okkam.it
> > wrote:
>
>> Hi Max,
>> actually I have a jar containing sampling jobs and I need to collect
>> results from a client.
>> I've tried to use ExecutionEnvironment.createRemoteEnvironment but I
>> fear that it's not the right way to do that because
>> I just need to tell the cluster the main class and the parameters to run
>> the job (and where the jar file is on HDFS).
>>
>> Best,
>> Flavio
>>
>> On Tue, Sep 27, 2016 at 12:06 PM, Maximilian Michels <mx...@apache.org>
>> wrote:
>>
>>> Hi Flavio,
>>>
>>> Do you want to sample from a running batch job? That would be like
>>> Queryable State in streaming jobs but it is not supported in batch
>>> mode.
>>>
>>> Cheers,
>>> Max
>>>
>>>
>>> On Mon, Sep 26, 2016 at 6:13 PM, Flavio Pompermaier
>>> <po...@okkam.it> wrote:
>>> > Hi to all,
>>> >
>>> > I have a use case where I need to tell a Flink cluster to give me a
>>> sample
>>> > of X records using parametrizable sampling functions. Is there any best
>>> > practice or advice to do that?
>>> >
>>> > Should I create a Remote ExecutionEnvironment or should I use the Flink
>>> > client (I don't know if it uses REST services or RPC or whatever)?
>>> > Is there any java snippet for that?
>>> >
>>> > Best,
>>> > Flavio
>>> >
>>>
>>
>>
>>
>>
>

Re: Best way to trigger dataset sampling

Posted by Maximilian Michels <mx...@apache.org>.
Hi Flavio,

This is not really possible at the moment. Though there is a workaround.
You can create a dummy jar file (may be empty). Then you can use

./flink run -C hdfs:///path/to/cluster.jar -c org.package.SampleClass
/path/to/dummy.jar

That way Flink will include your cluster jar and you can load all classes
necessary.

Alternatively, using the Remote Environment, this looks like this:

public static void main(String[] args) throws Exception {

   final RemoteEnvironment env = new RemoteEnvironment(
      "remoteHost",
      6123,
      new Configuration(),
      new String[0],
      new URL[]{
         new URL("file:///path/to/sample.jar"),
         new URL("file:///Users/max/Dev/flink/build-target/lib/flink-dist_2.10-1.2-SNAPSHOT.jar")});
   URLClassLoader classLoader = new
URLClassLoader(env.globalClasspaths.toArray(new URL[0]));

   Class<?> clazz = classLoader.loadClass("org.package.sample.SampleClass");

   Method main = clazz.getDeclaredMethod("sampleMethod",
ExecutionEnvironment.class);

   // pass environment as an argument to your sample method
   // the method should return the results of the execution
   Object sampleResult = main.invoke(null, env);
}


Beware, this is extremely hacky. We should have a better way to invoke jar
files remotely. Honestly, the best thing is if you keep a local copy of
your sampling jars and work directly with them.

Cheers,
Max

On Tue, Sep 27, 2016 at 12:25 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Hi Max,
> actually I have a jar containing sampling jobs and I need to collect
> results from a client.
> I've tried to use ExecutionEnvironment.createRemoteEnvironment but I fear
> that it's not the right way to do that because
> I just need to tell the cluster the main class and the parameters to run
> the job (and where the jar file is on HDFS).
>
> Best,
> Flavio
>
> On Tue, Sep 27, 2016 at 12:06 PM, Maximilian Michels <mx...@apache.org>
> wrote:
>
>> Hi Flavio,
>>
>> Do you want to sample from a running batch job? That would be like
>> Queryable State in streaming jobs but it is not supported in batch
>> mode.
>>
>> Cheers,
>> Max
>>
>>
>> On Mon, Sep 26, 2016 at 6:13 PM, Flavio Pompermaier
>> <po...@okkam.it> wrote:
>> > Hi to all,
>> >
>> > I have a use case where I need to tell a Flink cluster to give me a
>> sample
>> > of X records using parametrizable sampling functions. Is there any best
>> > practice or advice to do that?
>> >
>> > Should I create a Remote ExecutionEnvironment or should I use the Flink
>> > client (I don't know if it uses REST services or RPC or whatever)?
>> > Is there any java snippet for that?
>> >
>> > Best,
>> > Flavio
>> >
>>
>
>
>
>

Re: Best way to trigger dataset sampling

Posted by Flavio Pompermaier <po...@okkam.it>.
Hi Max,
actually I have a jar containing sampling jobs and I need to collect
results from a client.
I've tried to use ExecutionEnvironment.createRemoteEnvironment but I fear
that it's not the right way to do that because
I just need to tell the cluster the main class and the parameters to run
the job (and where the jar file is on HDFS).

Best,
Flavio

On Tue, Sep 27, 2016 at 12:06 PM, Maximilian Michels <mx...@apache.org> wrote:

> Hi Flavio,
>
> Do you want to sample from a running batch job? That would be like
> Queryable State in streaming jobs but it is not supported in batch
> mode.
>
> Cheers,
> Max
>
>
> On Mon, Sep 26, 2016 at 6:13 PM, Flavio Pompermaier
> <po...@okkam.it> wrote:
> > Hi to all,
> >
> > I have a use case where I need to tell a Flink cluster to give me a
> sample
> > of X records using parametrizable sampling functions. Is there any best
> > practice or advice to do that?
> >
> > Should I create a Remote ExecutionEnvironment or should I use the Flink
> > client (I don't know if it uses REST services or RPC or whatever)?
> > Is there any java snippet for that?
> >
> > Best,
> > Flavio
> >
>

Re: Best way to trigger dataset sampling

Posted by Maximilian Michels <mx...@apache.org>.
Hi Flavio,

Do you want to sample from a running batch job? That would be like
Queryable State in streaming jobs but it is not supported in batch
mode.

Cheers,
Max


On Mon, Sep 26, 2016 at 6:13 PM, Flavio Pompermaier
<po...@okkam.it> wrote:
> Hi to all,
>
> I have a use case where I need to tell a Flink cluster to give me a sample
> of X records using parametrizable sampling functions. Is there any best
> practice or advice to do that?
>
> Should I create a Remote ExecutionEnvironment or should I use the Flink
> client (I don't know if it uses REST services or RPC or whatever)?
> Is there any java snippet for that?
>
> Best,
> Flavio
>