You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Matt Kangas <ka...@gmail.com> on 2006/01/06 18:11:29 UTC
creating MapFiles from unsorted data?
Hi folks,
I'm in the process of cleaning up my WhitelistURLFilter (NUTCH-87 on
JIRA), and I've got a question about working with
org.apache.nutch.io.MapFile.
I am parsing a textfile with one key/value pair per line. I want to
write this into a new MapFile. MapFile.Writer requires keys to be
added strictly in-order, so currently I:
- read the textfile into an ArrayList
- sort (in RAM)
- write the MapFile
Clearly this won't scale for a large textfile, so I'm changing it to
use as temporary SequenceFile instead. Then I'll sort the
SequenceFile, and copy item-by-item into the MapFile.
While I'm doing this, I'm wondering if there isn't a way to avoid the
2nd copy.
Is there a way to create a MapFile from an already-sorted
SequenceFile? Or, create an unsorted "data" file, sort it, then add
the "index"? I didn't see anything in MapFile.* to permit this, but
I'm probably missing something.
--Matt
--
Matt Kangas / kangas@gmail.com
Re: creating MapFiles from unsorted data?
Posted by Matt Kangas <ka...@gmail.com>.
Thanks for the quick feedback! I'll use the existing facilities to
finish NUTCH-87 for now. There's a good chance that I'll need to do
more stuff like this soon, 'tho, and if so, I'll consider patching
MapFile.
--Matt
On Jan 6, 2006, at 2:12 PM, Doug Cutting wrote:
> Matt Kangas wrote:
>> Clearly this won't scale for a large textfile, so I'm changing it
>> to use as temporary SequenceFile instead. Then I'll sort the
>> SequenceFile, and copy item-by-item into the MapFile.
>> While I'm doing this, I'm wondering if there isn't a way to avoid
>> the 2nd copy.
>
> No, not presently. So the cost of sorting becomes n*(log(n)+1),
> which is to say, the 2nd copy will slow things, but not hugely.
>
>> Is there a way to create a MapFile from an already-sorted
>> SequenceFile?
>
> No, but it wouldn't be too hard to add one.
>
>> Or, create an unsorted "data" file, sort it, then add the "index"?
>
> Right, that's the way I'd implement sorted-SequenceFile ->
> Mapfile. So MapFile might get a new public static method that:
> 1. Moves a sorted SequenceFile to <File>/data; and
> 2. Creates an index file in <File>/index.
>
> Note that this would still have to read the entire SequenceFile, so
> all that's saved is re-writing it.
>
> Doug
--
Matt Kangas / kangas@gmail.com
Re: creating MapFiles from unsorted data?
Posted by Doug Cutting <cu...@nutch.org>.
Matt Kangas wrote:
> Clearly this won't scale for a large textfile, so I'm changing it to
> use as temporary SequenceFile instead. Then I'll sort the SequenceFile,
> and copy item-by-item into the MapFile.
>
> While I'm doing this, I'm wondering if there isn't a way to avoid the
> 2nd copy.
No, not presently. So the cost of sorting becomes n*(log(n)+1), which
is to say, the 2nd copy will slow things, but not hugely.
> Is there a way to create a MapFile from an already-sorted SequenceFile?
No, but it wouldn't be too hard to add one.
> Or, create an unsorted "data" file, sort it, then add the "index"?
Right, that's the way I'd implement sorted-SequenceFile -> Mapfile. So
MapFile might get a new public static method that:
1. Moves a sorted SequenceFile to <File>/data; and
2. Creates an index file in <File>/index.
Note that this would still have to read the entire SequenceFile, so all
that's saved is re-writing it.
Doug