You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Andreas Paepcke <pa...@cs.stanford.edu> on 2011/01/31 20:11:25 UTC

Splitting Strategy When Records Flow From the Net at Runtime

I pull records from a remote Web site. I have a subclass of
RecordReader, which knows how to retrieve those records one by one
from a Web stream. The Web site is set up such that I can run multiple
such readers, each pulling a distinct subset of the records from the
site.

My strategy plan: In my subclass of InputFormat I figure out a good
load balance, given a number of mapper machines. My splits will then
identify record subsets, which each mapper is to pull from the Web
site and process at runtime.

Two questions:
   1. InputSplit wants a "list of nodes by name where the data for the
      split would be local." But the records will be pulled at runtime
      by each mapper. So no data is local to a node. Is it safe to
      return an empty String[] from my getLocations() implementation?
   2. The getLength() method also seems geared towards files. In my
      case I would presumably just return the number of records each
      split will retrieve from the Web server? I.e. the cardinality of
      my subsets?

Thanks!

Andreas

Re: Splitting Strategy When Records Flow From the Net at Runtime

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Andreas, I don't remember off the top of my head but I think both an
empty locations list and cardinality of set as "length" are correct.
Double-check on the hadoop map-reduce user list..

D

On Mon, Jan 31, 2011 at 11:11 AM, Andreas Paepcke
<pa...@cs.stanford.edu> wrote:
> I pull records from a remote Web site. I have a subclass of
> RecordReader, which knows how to retrieve those records one by one
> from a Web stream. The Web site is set up such that I can run multiple
> such readers, each pulling a distinct subset of the records from the
> site.
>
> My strategy plan: In my subclass of InputFormat I figure out a good
> load balance, given a number of mapper machines. My splits will then
> identify record subsets, which each mapper is to pull from the Web
> site and process at runtime.
>
> Two questions:
>   1. InputSplit wants a "list of nodes by name where the data for the
>      split would be local." But the records will be pulled at runtime
>      by each mapper. So no data is local to a node. Is it safe to
>      return an empty String[] from my getLocations() implementation?
>   2. The getLength() method also seems geared towards files. In my
>      case I would presumably just return the number of records each
>      split will retrieve from the Web server? I.e. the cardinality of
>      my subsets?
>
> Thanks!
>
> Andreas
>