You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Matt Kangas <ka...@gmail.com> on 2006/01/06 18:11:29 UTC

creating MapFiles from unsorted data?

Hi folks,

I'm in the process of cleaning up my WhitelistURLFilter (NUTCH-87 on  
JIRA), and I've got a question about working with  
org.apache.nutch.io.MapFile.

I am parsing a textfile with one key/value pair per line. I want to  
write this into a new MapFile. MapFile.Writer requires keys to be  
added strictly in-order, so currently I:
- read the textfile into an ArrayList
- sort (in RAM)
- write the MapFile

Clearly this won't scale for a large textfile, so I'm changing it to  
use as temporary SequenceFile instead. Then I'll sort the  
SequenceFile, and copy item-by-item into the MapFile.

While I'm doing this, I'm wondering if there isn't a way to avoid the  
2nd copy.

Is there a way to create a MapFile from an already-sorted  
SequenceFile? Or, create an unsorted "data" file, sort it, then add  
the "index"? I didn't see anything in MapFile.* to permit this, but  
I'm probably missing something.

--Matt

--
Matt Kangas / kangas@gmail.com



Re: creating MapFiles from unsorted data?

Posted by Matt Kangas <ka...@gmail.com>.
Thanks for the quick feedback! I'll use the existing facilities to  
finish NUTCH-87 for now. There's a good chance that I'll need to do  
more stuff like this soon, 'tho, and if so, I'll consider patching  
MapFile.

--Matt

On Jan 6, 2006, at 2:12 PM, Doug Cutting wrote:

> Matt Kangas wrote:
>> Clearly this won't scale for a large textfile, so I'm changing it  
>> to  use as temporary SequenceFile instead. Then I'll sort the   
>> SequenceFile, and copy item-by-item into the MapFile.
>> While I'm doing this, I'm wondering if there isn't a way to avoid  
>> the  2nd copy.
>
> No, not presently.  So the cost of sorting becomes n*(log(n)+1),  
> which is to say, the 2nd copy will slow things, but not hugely.
>
>> Is there a way to create a MapFile from an already-sorted  
>> SequenceFile?
>
> No, but it wouldn't be too hard to add one.
>
>> Or, create an unsorted "data" file, sort it, then add  the "index"?
>
> Right, that's the way I'd implement sorted-SequenceFile ->  
> Mapfile.  So MapFile might get a new public static method that:
>   1. Moves a sorted SequenceFile to <File>/data; and
>   2. Creates an index file in <File>/index.
>
> Note that this would still have to read the entire SequenceFile, so  
> all that's saved is re-writing it.
>
> Doug

--
Matt Kangas / kangas@gmail.com



Re: creating MapFiles from unsorted data?

Posted by Doug Cutting <cu...@nutch.org>.
Matt Kangas wrote:
> Clearly this won't scale for a large textfile, so I'm changing it to  
> use as temporary SequenceFile instead. Then I'll sort the  SequenceFile, 
> and copy item-by-item into the MapFile.
> 
> While I'm doing this, I'm wondering if there isn't a way to avoid the  
> 2nd copy.

No, not presently.  So the cost of sorting becomes n*(log(n)+1), which 
is to say, the 2nd copy will slow things, but not hugely.

> Is there a way to create a MapFile from an already-sorted SequenceFile? 

No, but it wouldn't be too hard to add one.

> Or, create an unsorted "data" file, sort it, then add  the "index"?

Right, that's the way I'd implement sorted-SequenceFile -> Mapfile.  So 
MapFile might get a new public static method that:
   1. Moves a sorted SequenceFile to <File>/data; and
   2. Creates an index file in <File>/index.

Note that this would still have to read the entire SequenceFile, so all 
that's saved is re-writing it.

Doug