You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Dmitriy Ryaboy <dv...@gmail.com> on 2011/02/01 09:50:19 UTC

Re: Splitting Strategy When Records Flow From the Net at Runtime

Andreas, I don't remember off the top of my head but I think both an
empty locations list and cardinality of set as "length" are correct.
Double-check on the hadoop map-reduce user list..

D

On Mon, Jan 31, 2011 at 11:11 AM, Andreas Paepcke
<pa...@cs.stanford.edu> wrote:
> I pull records from a remote Web site. I have a subclass of
> RecordReader, which knows how to retrieve those records one by one
> from a Web stream. The Web site is set up such that I can run multiple
> such readers, each pulling a distinct subset of the records from the
> site.
>
> My strategy plan: In my subclass of InputFormat I figure out a good
> load balance, given a number of mapper machines. My splits will then
> identify record subsets, which each mapper is to pull from the Web
> site and process at runtime.
>
> Two questions:
>   1. InputSplit wants a "list of nodes by name where the data for the
>      split would be local." But the records will be pulled at runtime
>      by each mapper. So no data is local to a node. Is it safe to
>      return an empty String[] from my getLocations() implementation?
>   2. The getLength() method also seems geared towards files. In my
>      case I would presumably just return the number of records each
>      split will retrieve from the Web server? I.e. the cardinality of
>      my subsets?
>
> Thanks!
>
> Andreas
>