You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Matt Pearson <mp...@lizearle.com> on 2009/01/05 16:32:22 UTC

nutch segment format

Hi Everyone,

 

I'm looking into reading data from Nutch segments with PHP is there
anywhere where I can get information on the format in which the data is
stored?

 

Thanks and apologies if this isn't the right place to ask this question.

 

 

Matt Pearson 

 

 


Re: nutch segment format

Posted by Todd Lipcon <tl...@gmail.com>.
Hi Matt,

The nutch segments are stored as Hadoop SequenceFiles and MapFiles. MapFile
is made up of multiple SequenceFiles. I'm not certain if the format is
documented anywhere, but the source is in org.apache.hadoop.io. I doubt
you'll find a PHP library for reading them, so you'll probably have to write
something yourself.

-Todd

On Mon, Jan 5, 2009 at 10:32 AM, Matt Pearson <mp...@lizearle.com> wrote:

>  Hi Everyone,
>
>
>
> I'm looking into reading data from Nutch segments with PHP is there
> anywhere where I can get information on the format in which the data is
> stored?
>
>
>
> Thanks and apologies if this isn't the right place to ask this question.
>
>
>
>
>
> Matt Pearson
>
>
>
>
>