You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Black, Michael (IS)" <Mi...@ngc.com> on 2010/12/24 17:34:57 UTC

Custom input split

Using hadoop-0.20


I'm doing custom input splits from a Lucene index.

I want to split the document ID's across N mappers (I'm testing the
scalabilty of the problem across 4 nodes and 8 cores).

So the key is the document# and they are not sequential.

At this point I'm using splits.add to add each document...but that sets up
one task for every document...not something I want to do of course.

How can I add a group of documents to each split?  I found a scant reference
to PrimeInputSplit but that doesn't seem to resolve on hadoop-0.20.


Michael D. Black
Senior Scientist
Nothrop Grumman Information Systems
Advanced Analytics Directorate

Re: Custom input split

Posted by Lance Norskog <go...@gmail.com>.

Please don't use attachments. They should be stripped by the Apache
mailer. There are a bunch of mail archiver sites which don't save
attachments.

Lance

On Sun, Dec 26, 2010 at 8:20 AM, Harsh J <qw...@gmail.com> wrote:
> Hi,
>
> On Sun, Dec 26, 2010 at 6:29 PM, Black, Michael (IS)
> <Mi...@ngc.com> wrote:
>> I assume there's a way to make a specific # of splits and add each document to the separate splits...but I'll be darned if I can find the docs or an example to show this.
>
> Would CombineFileInputFormat and CombineFileSplit be what you're looking for?
>
> Doc links: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
> & http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/CombineFileSplit.html
>
>> As I said I'm using hadoop-0.20.2 which I know makes a difference as so many things get deprecated on each release.  Old references don't seem to work.
>
> The API marked deprecated in 0.20.{0,1,2} has been un-deprecated in
> the 0.21.0 release  and is also considered as the "stable" API. You
> can continue using it, as it is still supported.
>
> (Maybe 0.20.3 will have them un-deprecated too, I'm not sure what's
> the status on that, although doing so would surely help avoid beginner
> confusion.)
> --
> Harsh J
> www.harshj.com
>



-- 
Lance Norskog
goksron@gmail.com

Re: Custom input split

Posted by Harsh J <qw...@gmail.com>.

Hi,

On Sun, Dec 26, 2010 at 6:29 PM, Black, Michael (IS)
<Mi...@ngc.com> wrote:
> I assume there's a way to make a specific # of splits and add each document to the separate splits...but I'll be darned if I can find the docs or an example to show this.

Would CombineFileInputFormat and CombineFileSplit be what you're looking for?

Doc links: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
& http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/CombineFileSplit.html

> As I said I'm using hadoop-0.20.2 which I know makes a difference as so many things get deprecated on each release.  Old references don't seem to work.

The API marked deprecated in 0.20.{0,1,2} has been un-deprecated in
the 0.21.0 release  and is also considered as the "stable" API. You
can continue using it, as it is still supported.

(Maybe 0.20.3 will have them un-deprecated too, I'm not sure what's
the status on that, although doing so would surely help avoid beginner
confusion.)
-- 
Harsh J
www.harshj.com

Re: Custom input split

Posted by "Black, Michael (IS)" <Mi...@ngc.com>.

You mean the file is "not trusted".  I was using Outlook and my company automatically puts a digital certificate on all emails.   I'm using webmail right now which doesn't.  That certificate is installed by default on all company computers so it looks trusted to us without having to explicitly trust the certificate.
 
I don't think my split problem has anything to do with the Lucene index...that was just informational.
 
Here's my getsplits...it calls other functions which aren't important to the problem at hand...
 
        public List<InputSplit> getSplits(JobContext context) throws IOException,
                        InterruptedException {
                Configuration conf = context.getConfiguration();
                List<InputSplit> splits = new ArrayList<InputSplit>;
                Indexer indexer = new Indexer(conf.get(Config.Index), true);
                Iterator<Document> iDocument = indexer.iterator();
                int ndocs=20; // limt the # of docs for testing -- got over 100,000 of these
                while(iDocument.hasNext() && i < 20) {
                        Document document = iDocument.next();
                        String docid = document.getId();
                        System.out.println("Adding ID  " + docid);
                        splits.add(new PInputSplit(docid));
                }
                indexer.close();
                return splits;
        }

I assume there's a way to make a specific # of splits and add each document to the separate splits...but I'll be darned if I can find the docs or an example to show this.
 
As I said I'm using hadoop-0.20.2 which I know makes a difference as so many things get deprecated on each release.  Old references don't seem to work.
 
Michael D. Black
Senior Scientist
Advanced Analytics Directorate
Northrop Grumman Information Systems
 

________________________________

From: ?? [mailto:toppiprc@gmail.com]
Sent: Sat 12/25/2010 10:32 AM
To: common-user@hadoop.apache.org
Subject: EXTERNAL:Re: Custom input split



What is the file you have attached? It is not safe.

I don't know the format of lucene index, would you please give an example?

Re: Custom input split

Posted by 蔡超 <to...@gmail.com>.

What is the file you have attached? It is not safe.

I don't know the format of lucene index, would you please give an example?


On Sat, Dec 25, 2010 at 12:34 AM, Black, Michael (IS) <
Michael.Black2@ngc.com> wrote:

> Using hadoop-0.20
>
>
> I'm doing custom input splits from a Lucene index.
>
> I want to split the document ID's across N mappers (I'm testing the
> scalabilty of the problem across 4 nodes and 8 cores).
>
> So the key is the document# and they are not sequential.
>
> At this point I'm using splits.add to add each document...but that sets up
> one task for every document...not something I want to do of course.
>
> How can I add a group of documents to each split?  I found a scant
> reference
> to PrimeInputSplit but that doesn't seem to resolve on hadoop-0.20.
>
>
> Michael D. Black
> Senior Scientist
> Nothrop Grumman Information Systems
> Advanced Analytics Directorate
>
>
>
>