You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Andreas Paepcke <pa...@cs.stanford.edu> on 2011/01/31 20:11:25 UTC
Splitting Strategy When Records Flow From the Net at Runtime
I pull records from a remote Web site. I have a subclass of
RecordReader, which knows how to retrieve those records one by one
from a Web stream. The Web site is set up such that I can run multiple
such readers, each pulling a distinct subset of the records from the
site.
My strategy plan: In my subclass of InputFormat I figure out a good
load balance, given a number of mapper machines. My splits will then
identify record subsets, which each mapper is to pull from the Web
site and process at runtime.
Two questions:
1. InputSplit wants a "list of nodes by name where the data for the
split would be local." But the records will be pulled at runtime
by each mapper. So no data is local to a node. Is it safe to
return an empty String[] from my getLocations() implementation?
2. The getLength() method also seems geared towards files. In my
case I would presumably just return the number of records each
split will retrieve from the Web server? I.e. the cardinality of
my subsets?
Thanks!
Andreas
Re: Splitting Strategy When Records Flow From the Net at Runtime
Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Andreas, I don't remember off the top of my head but I think both an
empty locations list and cardinality of set as "length" are correct.
Double-check on the hadoop map-reduce user list..
D
On Mon, Jan 31, 2011 at 11:11 AM, Andreas Paepcke
<pa...@cs.stanford.edu> wrote:
> I pull records from a remote Web site. I have a subclass of
> RecordReader, which knows how to retrieve those records one by one
> from a Web stream. The Web site is set up such that I can run multiple
> such readers, each pulling a distinct subset of the records from the
> site.
>
> My strategy plan: In my subclass of InputFormat I figure out a good
> load balance, given a number of mapper machines. My splits will then
> identify record subsets, which each mapper is to pull from the Web
> site and process at runtime.
>
> Two questions:
> 1. InputSplit wants a "list of nodes by name where the data for the
> split would be local." But the records will be pulled at runtime
> by each mapper. So no data is local to a node. Is it safe to
> return an empty String[] from my getLocations() implementation?
> 2. The getLength() method also seems geared towards files. In my
> case I would presumably just return the number of records each
> split will retrieve from the Web server? I.e. the cardinality of
> my subsets?
>
> Thanks!
>
> Andreas
>