You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ali Safdar Kureishy <sa...@gmail.com> on 2012/05/05 22:05:53 UTC

Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Hi,

I have attached a *Sequence* file with the following format:
<url:Text> <data:CrawlDatum>

(CrawlDatum is a custom Java type, that contains several fields that would
be flattened into several columns by the SerDe).

In other words, what I would like to do, is to expose this URL+CrawlDatum
data via a Hive External table, with the following columns:
|| url || status || fetchtime || fetchinterval || modifiedtime || retries
|| score || metadata ||

So, I was hoping that after defining a custom SerDe, I would just have to
define the Hive table as follows:

CREATE EXTERNAL TABLE *crawldb*
(url STRING, status STRING, fetchtime LONG, fetchinterval LONG,
modifiedtime LONG, retries INT, score FLOAT, metadata MAP)
ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
STORED AS *SEQUENCEFILE*
LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';

For example, a sample record should like like the following through a Hive
table:
|| http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834 ||
1 || 0.98 || {x=1,y=2,p=3,q=4} ||

I would like this to be possible without having to duplicate/flatten the
data through a separate transformation. Initially, I thought my custom
SerDe could have following definition for serialize():

        @override
public Object deserialize(*Writable** obj*) throws SerDeException {
            ...
         }

But the problem is that the input argument *obj *above is only the
*VALUE* portion
of a Sequence record. There seems to be a limitation with the way Hive
reads Sequence files. Specifically, for each row in a sequence file, the
KEY is ignored and only the VALUE is used by Hive. This is seen from the *
org.apache.hadoop.hive.ql.**exec.FetchOperator*::*getNextRow*() method
below, which ignores the KEY when iterating over a RecordReader (see bold
text below from the corresponding Hive code for
FetchOperator::getNextRow()):

  /**
   * Get the next row. The fetch context is modified appropriately.
   *
   **/
  public InspectableObject getNextRow() throws IOException {
    try {
      while (true) {
        if (currRecReader == null) {
          currRecReader = getRecordReader();
          if (currRecReader == null) {
            return null;
          }
        }

        boolean ret = currRecReader.next(*key*, *value*);
        if (ret) {
          if (this.currPart == null) {
*            *Object obj = serde.deserialize(*value*);
            return new InspectableObject(obj*, *serde.getObjectInspector());
          } else {
            rowWithPart[0] = serde.deserialize(*value*);
            return new InspectableObject(rowWithPart, rowObjectInspector);
          }
        } else {
          currRecReader.close();
          currRecReader = null;
        }
      }
    } catch (Exception e) {
      throw new IOException(e);
    }
  }

As you can see, the "key" variable is ignored and never returned. The
problem is that in the Nutch crawldb Sequence File, the KEY is the URL, and
I need it to be displayed in the Hive table along with the fields of
CrawlDatum. But when writing the the custom SerDe, I only see the
CrawlDatum that comes after the key, on each record...which is not
sufficient.

One hack could be to write a CustomSequenceFileRecordReader.java that
returns the offset in the sequence file as the KEY, and an aggregation of
the (Key+Value) as the VALUE. For that, perhaps I need to hack the code
below from SequenceFileRecordReader, which will get really very messy:
  protected synchronized boolean next(K key)
    throws IOException {
    if (!more) return false;
    long pos = in.getPosition();
    boolean remaining = (in.next(key) != null);
    if (pos >= end && in.syncSeen()) {
      more = false;
    } else {
      more = remaining;
    }
    return more;
  }

This would require me to write a CustomSequenceFileRecordReader and a
CustomSequenceFileInputFormat and then some custom SerDe, and probably make
several other changes as well. Is it possible to just get away with writing
a custom SerDe and some pre-existing reader that includes the key when
invoking SerDe.deserialize()? Unless I'm missing something, why does Hive
have this limitation, when accessing Sequence files? I would imagine that
the key of a sequence file record would be just as important as the
value...so why is it left out by the FetchOperator:getNextRow() method?

If this is the unfortunate reality with reading sequence files in Nutch, is
there another Hive storage format I should use that works around this
limitation? Such as "create external table ..... *STORED AS
CUSTOM_SEQUENCEFILE*"? Or, let's say I write my own
CustomHiveSequenceFileInputFormat, how do i register it with Hive and use
it in the Hive "STORED AS" definition?

Any help or pointers would be greatly appreciated. I hope I'm mistaken
about the limitation above, and if not, hopefully there is an easy way to
resolve this through a custom SerDe alone.

Warm regards,
Safdar

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Posted by Ali Safdar Kureishy <sa...@gmail.com>.
Also, if I return a fully formatted string containing all the flattened
values from my key+value (such as what you suggested), then I'd need to
split the resulting string into its component columns based on the
delimiter ("," or ";" or "\t" etc). How do I define the right table for
that?

In other words, my custom input format will return a value string of this
form:
<Text>;<cd.status>;<cd.fetchTime>;<cd.retries>;<cd.map>;.....

And so, on the Hive side, I'd like to use a ";" as the delimiter. Typically
this Hive table would be defined as:

CREATE TABLE crawldb (.....)
ROWFORMAT DELIMITED
FIELDS SEPARATED BY ';'
....
....

Would I now be able to define my table the same way, using my custom input
format:
*CREATE TABLE crawldb (...)
INPUTFORMAT 'MyFlatteningInputFormat'
FIELDS SEPARATED BY ';'
LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';*
?

Thanks,
Safdar

On Sun, May 6, 2012 at 4:34 AM, Ali Safdar Kureishy <
safdar.kureishy@gmail.com> wrote:

> Thanks Edward.
>
> What are the Input and Output formats chosen by Hive for the "STORED
> AS SEQUENCEFILE" selection? And if I want to add my own syntactic
> sugar, is there a lookup mechanism where I can register my custom code
> so that it would work with "STORED AS MYCUSTOMSEQUENCEFILE"?
>
> Thanks,
> Safdar
>
>
> On Sun, May 6, 2012 at 1:16 AM, Edward Capriolo <ed...@gmail.com>
> wrote:
> > Stored as sequence file is syntax sugar. It sets both the inputformat and
> > outputformat.
> >
> > Create table x (thing int)
> > Inputformat 'class.x'
> > Outputformat 'class.y'
> >
> > For inputformat you can use your custom.
> >
> > For your output format you can stick with hive's
> ignorekeytextoutputformat
> > or ignorekeysequencefile format.
> >
> > To avoid having to write a serde your inputformat could also Chang the
> types
> > and format to something hive could easily recognize.
> >
> >
> > On Saturday, May 5, 2012, Ali Safdar Kureishy <safdar.kureishy@gmail.com
> >
> > wrote:
> >> Thanks Edward...I feared this was going to be the case.
> >> If I define a new input format, how do I use it in a hive table
> >> definition?
> >> For the SequenceFileInputFormat, the table definition would read as
> >> "...STORED AS SEQUENCEFILE".
> >> With the new one, how do I specify it in the definition? "STORED AS
> >> 'com.xyz.abc.MyInputFormat'?
> >> Thanks,
> >> Safdar
> >>
> >> On Sat, May 5, 2012 at 2:44 PM, Edward Capriolo <ed...@gmail.com>
> >> wrote:
> >>
> >> This is one of the things about hive the key is not easily available.
> >> You are going to need an input format that creates a new value which
> >> is contains the key and the value.
> >>
> >> Like this:
> >> <url:Text> <data:CrawlDatum> -> <null-writable>  new
> >> MyKeyValue<<url:Text> <data:CrawlDatum>>
> >>
> >>
> >> On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy
> >> <sa...@gmail.com> wrote:
> >>> Hi,
> >>>
> >>> I have attached a Sequence file with the following format:
> >>> <url:Text> <data:CrawlDatum>
> >>>
> >>> (CrawlDatum is a custom Java type, that contains several fields that
> >>> would
> >>> be flattened into several columns by the SerDe).
> >>>
> >>> In other words, what I would like to do, is to expose this
> URL+CrawlDatum
> >>> data via a Hive External table, with the following columns:
> >>> || url || status || fetchtime || fetchinterval || modifiedtime ||
> retries
> >>> ||
> >>> score || metadata ||
> >>>
> >>> So, I was hoping that after defining a custom SerDe, I would just have
> to
> >>> define the Hive table as follows:
> >>>
> >>> CREATE EXTERNAL TABLE crawldb
> >>> (url STRING, status STRING, fetchtime LONG, fetchinterval LONG,
> >>> modifiedtime
> >>> LONG, retries INT, score FLOAT, metadata MAP)
> >>> ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
> >>> STORED AS SEQUENCEFILE
> >>> LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';
> >>>
> >>> For example, a sample record should like like the following through a
> >>> Hive
> >>> table:
> >>> || http://www.cnn.com || FETCHED || 125355734857 || 36000 ||
> 12453775834
> >>> ||
> >>> 1 || 0.98 || {x=1,y=2,p=3,q=4} ||
> >>>
> >>> I would like this to be possible without having to duplicate/flatten
> the
> >>> data through a separate transformation. Initially, I thought my custom
> >>> SerDe
> >>> could have following definition for serialize():
> >>>
> >>>         @override
> >>> public Object deserialize(Writable obj) throws SerDeException {
> >>>             ...
> >>>          }
> >>>
> >>> But the problem is that the input argument obj above is only the
> >>> VALUE portion of a Sequence record. There seems to be a limitation with
> >>> the
> >>> way Hive reads Sequence files. Specifically, for each row in a sequence
> >>> file, the KEY is ignored and only the VALUE is used by Hive. This is
> seen
> >>> from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow()
> >>> method
> >>> below, which ignores the KEY when iterating over a RecordReader (see
> bold
> >>> text below from the corresponding Hive code for
> >>> FetchOperator::getNextRow()):
> >>>
> >>>   /**
> >>>    * Get the next row. The fetch context is modified appropriately.
> >>>    *
> >>>    **/
> >>>   public InspectableObject getNextRow() throws IOException {
> >>>     try {
> >>>       while (true) {
> >>>         if (currRecReader == null) {
> >>>           currRecReader = getRecordReader();
> >>>           if (currRecReader == null) {
> >>>             return null;
> >>>           }
> >>>         }
> >>>
> >>>         boolean ret = currRecReader.next(key, value);
> >>>         if (ret) {
> >>>           if (this.currPart == null) {
> >>>             Object obj = serde.deserialize(value);
> >>>             return new InspectableObject(obj,
> >>> serde.getObjectInspector());
> >>>           } else {
> >>>             rowWithPart[0] = serde.deserialize(value);
> >>>             return new InspectableObject(rowWithPart,
> >>> rowObjectInspector);
> >>>           }
> >>>         } else {
> >>>           currRecReader.close();
> >>>           currRecReader = null;
> >>>         }
> >>>       }
> >>>     } catch (Exception e) {
> >>>       throw new IOException(e);
> >>>     }
> >>>   }
> >>>
> >>> As you can see, the "key" variable is ignored and never returned. The
> >>> problem is that in the Nutch crawldb Sequence File, the KEY is the URL,
> >>> and
> >>> I need it to be displayed in the Hive table along with the fields of
> >>> CrawlDatum. But when writing the the custom SerDe, I only see the
> >>> CrawlDatum
> >>> that comes after the key, on each record...which is not sufficient.
> >>>
> >>> One hack could be to write a CustomSequenceFileRecordReader.java that
> >>> returns the offset in the sequence file as the KEY, and an aggregation
> of
> >>> the (Key+Value) as th
>

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Posted by Ali Safdar Kureishy <sa...@gmail.com>.
Thanks Edward.

What are the Input and Output formats chosen by Hive for the "STORED
AS SEQUENCEFILE" selection? And if I want to add my own syntactic
sugar, is there a lookup mechanism where I can register my custom code
so that it would work with "STORED AS MYCUSTOMSEQUENCEFILE"?

Thanks,
Safdar


On Sun, May 6, 2012 at 1:16 AM, Edward Capriolo <ed...@gmail.com> wrote:
> Stored as sequence file is syntax sugar. It sets both the inputformat and
> outputformat.
>
> Create table x (thing int)
> Inputformat 'class.x'
> Outputformat 'class.y'
>
> For inputformat you can use your custom.
>
> For your output format you can stick with hive's ignorekeytextoutputformat
> or ignorekeysequencefile format.
>
> To avoid having to write a serde your inputformat could also Chang the types
> and format to something hive could easily recognize.
>
>
> On Saturday, May 5, 2012, Ali Safdar Kureishy <sa...@gmail.com>
> wrote:
>> Thanks Edward...I feared this was going to be the case.
>> If I define a new input format, how do I use it in a hive table
>> definition?
>> For the SequenceFileInputFormat, the table definition would read as
>> "...STORED AS SEQUENCEFILE".
>> With the new one, how do I specify it in the definition? "STORED AS
>> 'com.xyz.abc.MyInputFormat'?
>> Thanks,
>> Safdar
>>
>> On Sat, May 5, 2012 at 2:44 PM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>
>> This is one of the things about hive the key is not easily available.
>> You are going to need an input format that creates a new value which
>> is contains the key and the value.
>>
>> Like this:
>> <url:Text> <data:CrawlDatum> -> <null-writable>  new
>> MyKeyValue<<url:Text> <data:CrawlDatum>>
>>
>>
>> On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy
>> <sa...@gmail.com> wrote:
>>> Hi,
>>>
>>> I have attached a Sequence file with the following format:
>>> <url:Text> <data:CrawlDatum>
>>>
>>> (CrawlDatum is a custom Java type, that contains several fields that
>>> would
>>> be flattened into several columns by the SerDe).
>>>
>>> In other words, what I would like to do, is to expose this URL+CrawlDatum
>>> data via a Hive External table, with the following columns:
>>> || url || status || fetchtime || fetchinterval || modifiedtime || retries
>>> ||
>>> score || metadata ||
>>>
>>> So, I was hoping that after defining a custom SerDe, I would just have to
>>> define the Hive table as follows:
>>>
>>> CREATE EXTERNAL TABLE crawldb
>>> (url STRING, status STRING, fetchtime LONG, fetchinterval LONG,
>>> modifiedtime
>>> LONG, retries INT, score FLOAT, metadata MAP)
>>> ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
>>> STORED AS SEQUENCEFILE
>>> LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';
>>>
>>> For example, a sample record should like like the following through a
>>> Hive
>>> table:
>>> || http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834
>>> ||
>>> 1 || 0.98 || {x=1,y=2,p=3,q=4} ||
>>>
>>> I would like this to be possible without having to duplicate/flatten the
>>> data through a separate transformation. Initially, I thought my custom
>>> SerDe
>>> could have following definition for serialize():
>>>
>>>         @override
>>> public Object deserialize(Writable obj) throws SerDeException {
>>>             ...
>>>          }
>>>
>>> But the problem is that the input argument obj above is only the
>>> VALUE portion of a Sequence record. There seems to be a limitation with
>>> the
>>> way Hive reads Sequence files. Specifically, for each row in a sequence
>>> file, the KEY is ignored and only the VALUE is used by Hive. This is seen
>>> from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow()
>>> method
>>> below, which ignores the KEY when iterating over a RecordReader (see bold
>>> text below from the corresponding Hive code for
>>> FetchOperator::getNextRow()):
>>>
>>>   /**
>>>    * Get the next row. The fetch context is modified appropriately.
>>>    *
>>>    **/
>>>   public InspectableObject getNextRow() throws IOException {
>>>     try {
>>>       while (true) {
>>>         if (currRecReader == null) {
>>>           currRecReader = getRecordReader();
>>>           if (currRecReader == null) {
>>>             return null;
>>>           }
>>>         }
>>>
>>>         boolean ret = currRecReader.next(key, value);
>>>         if (ret) {
>>>           if (this.currPart == null) {
>>>             Object obj = serde.deserialize(value);
>>>             return new InspectableObject(obj,
>>> serde.getObjectInspector());
>>>           } else {
>>>             rowWithPart[0] = serde.deserialize(value);
>>>             return new InspectableObject(rowWithPart,
>>> rowObjectInspector);
>>>           }
>>>         } else {
>>>           currRecReader.close();
>>>           currRecReader = null;
>>>         }
>>>       }
>>>     } catch (Exception e) {
>>>       throw new IOException(e);
>>>     }
>>>   }
>>>
>>> As you can see, the "key" variable is ignored and never returned. The
>>> problem is that in the Nutch crawldb Sequence File, the KEY is the URL,
>>> and
>>> I need it to be displayed in the Hive table along with the fields of
>>> CrawlDatum. But when writing the the custom SerDe, I only see the
>>> CrawlDatum
>>> that comes after the key, on each record...which is not sufficient.
>>>
>>> One hack could be to write a CustomSequenceFileRecordReader.java that
>>> returns the offset in the sequence file as the KEY, and an aggregation of
>>> the (Key+Value) as th

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Posted by Ali Safdar Kureishy <sa...@gmail.com>.
Thanks Edward.

What are the Input and Output formats chosen by Hive for the "STORED
AS SEQUENCEFILE" selection? And if I want to add my own syntactic
sugar, is there a lookup mechanism where I can register my custom code
so that it would work with "STORED AS MYCUSTOMSEQUENCEFILE"?

Thanks,
Safdar


On Sun, May 6, 2012 at 1:16 AM, Edward Capriolo <ed...@gmail.com> wrote:
> Stored as sequence file is syntax sugar. It sets both the inputformat and
> outputformat.
>
> Create table x (thing int)
> Inputformat 'class.x'
> Outputformat 'class.y'
>
> For inputformat you can use your custom.
>
> For your output format you can stick with hive's ignorekeytextoutputformat
> or ignorekeysequencefile format.
>
> To avoid having to write a serde your inputformat could also Chang the types
> and format to something hive could easily recognize.
>
>
> On Saturday, May 5, 2012, Ali Safdar Kureishy <sa...@gmail.com>
> wrote:
>> Thanks Edward...I feared this was going to be the case.
>> If I define a new input format, how do I use it in a hive table
>> definition?
>> For the SequenceFileInputFormat, the table definition would read as
>> "...STORED AS SEQUENCEFILE".
>> With the new one, how do I specify it in the definition? "STORED AS
>> 'com.xyz.abc.MyInputFormat'?
>> Thanks,
>> Safdar
>>
>> On Sat, May 5, 2012 at 2:44 PM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>
>> This is one of the things about hive the key is not easily available.
>> You are going to need an input format that creates a new value which
>> is contains the key and the value.
>>
>> Like this:
>> <url:Text> <data:CrawlDatum> -> <null-writable>  new
>> MyKeyValue<<url:Text> <data:CrawlDatum>>
>>
>>
>> On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy
>> <sa...@gmail.com> wrote:
>>> Hi,
>>>
>>> I have attached a Sequence file with the following format:
>>> <url:Text> <data:CrawlDatum>
>>>
>>> (CrawlDatum is a custom Java type, that contains several fields that
>>> would
>>> be flattened into several columns by the SerDe).
>>>
>>> In other words, what I would like to do, is to expose this URL+CrawlDatum
>>> data via a Hive External table, with the following columns:
>>> || url || status || fetchtime || fetchinterval || modifiedtime || retries
>>> ||
>>> score || metadata ||
>>>
>>> So, I was hoping that after defining a custom SerDe, I would just have to
>>> define the Hive table as follows:
>>>
>>> CREATE EXTERNAL TABLE crawldb
>>> (url STRING, status STRING, fetchtime LONG, fetchinterval LONG,
>>> modifiedtime
>>> LONG, retries INT, score FLOAT, metadata MAP)
>>> ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
>>> STORED AS SEQUENCEFILE
>>> LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';
>>>
>>> For example, a sample record should like like the following through a
>>> Hive
>>> table:
>>> || http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834
>>> ||
>>> 1 || 0.98 || {x=1,y=2,p=3,q=4} ||
>>>
>>> I would like this to be possible without having to duplicate/flatten the
>>> data through a separate transformation. Initially, I thought my custom
>>> SerDe
>>> could have following definition for serialize():
>>>
>>>         @override
>>> public Object deserialize(Writable obj) throws SerDeException {
>>>             ...
>>>          }
>>>
>>> But the problem is that the input argument obj above is only the
>>> VALUE portion of a Sequence record. There seems to be a limitation with
>>> the
>>> way Hive reads Sequence files. Specifically, for each row in a sequence
>>> file, the KEY is ignored and only the VALUE is used by Hive. This is seen
>>> from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow()
>>> method
>>> below, which ignores the KEY when iterating over a RecordReader (see bold
>>> text below from the corresponding Hive code for
>>> FetchOperator::getNextRow()):
>>>
>>>   /**
>>>    * Get the next row. The fetch context is modified appropriately.
>>>    *
>>>    **/
>>>   public InspectableObject getNextRow() throws IOException {
>>>     try {
>>>       while (true) {
>>>         if (currRecReader == null) {
>>>           currRecReader = getRecordReader();
>>>           if (currRecReader == null) {
>>>             return null;
>>>           }
>>>         }
>>>
>>>         boolean ret = currRecReader.next(key, value);
>>>         if (ret) {
>>>           if (this.currPart == null) {
>>>             Object obj = serde.deserialize(value);
>>>             return new InspectableObject(obj,
>>> serde.getObjectInspector());
>>>           } else {
>>>             rowWithPart[0] = serde.deserialize(value);
>>>             return new InspectableObject(rowWithPart,
>>> rowObjectInspector);
>>>           }
>>>         } else {
>>>           currRecReader.close();
>>>           currRecReader = null;
>>>         }
>>>       }
>>>     } catch (Exception e) {
>>>       throw new IOException(e);
>>>     }
>>>   }
>>>
>>> As you can see, the "key" variable is ignored and never returned. The
>>> problem is that in the Nutch crawldb Sequence File, the KEY is the URL,
>>> and
>>> I need it to be displayed in the Hive table along with the fields of
>>> CrawlDatum. But when writing the the custom SerDe, I only see the
>>> CrawlDatum
>>> that comes after the key, on each record...which is not sufficient.
>>>
>>> One hack could be to write a CustomSequenceFileRecordReader.java that
>>> returns the offset in the sequence file as the KEY, and an aggregation of
>>> the (Key+Value) as th

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Posted by Edward Capriolo <ed...@gmail.com>.
Stored as sequence file is syntax sugar. It sets both the inputformat and
outputformat.

Create table x (thing int)
Inputformat 'class.x'
Outputformat 'class.y'

For inputformat you can use your custom.

For your output format you can stick with hive's ignorekeytextoutputformat
or ignorekeysequencefile format.

To avoid having to write a serde your inputformat could also Chang the
types and format to something hive could easily recognize.

On Saturday, May 5, 2012, Ali Safdar Kureishy <sa...@gmail.com>
wrote:
> Thanks Edward...I feared this was going to be the case.
> If I define a new input format, how do I use it in a hive table
definition?
> For the SequenceFileInputFormat, the table definition would read as
"...STORED AS SEQUENCEFILE".
> With the new one, how do I specify it in the definition? "STORED AS
'com.xyz.abc.MyInputFormat'?
> Thanks,
> Safdar
>
> On Sat, May 5, 2012 at 2:44 PM, Edward Capriolo <ed...@gmail.com>
wrote:
>
> This is one of the things about hive the key is not easily available.
> You are going to need an input format that creates a new value which
> is contains the key and the value.
>
> Like this:
> <url:Text> <data:CrawlDatum> -> <null-writable>  new
> MyKeyValue<<url:Text> <data:CrawlDatum>>
>
>
> On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy
> <sa...@gmail.com> wrote:
>> Hi,
>>
>> I have attached a Sequence file with the following format:
>> <url:Text> <data:CrawlDatum>
>>
>> (CrawlDatum is a custom Java type, that contains several fields that
would
>> be flattened into several columns by the SerDe).
>>
>> In other words, what I would like to do, is to expose this URL+CrawlDatum
>> data via a Hive External table, with the following columns:
>> || url || status || fetchtime || fetchinterval || modifiedtime ||
retries ||
>> score || metadata ||
>>
>> So, I was hoping that after defining a custom SerDe, I would just have to
>> define the Hive table as follows:
>>
>> CREATE EXTERNAL TABLE crawldb
>> (url STRING, status STRING, fetchtime LONG, fetchinterval LONG,
modifiedtime
>> LONG, retries INT, score FLOAT, metadata MAP)
>> ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
>> STORED AS SEQUENCEFILE
>> LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';
>>
>> For example, a sample record should like like the following through a
Hive
>> table:
>> || http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834
||
>> 1 || 0.98 || {x=1,y=2,p=3,q=4} ||
>>
>> I would like this to be possible without having to duplicate/flatten the
>> data through a separate transformation. Initially, I thought my custom
SerDe
>> could have following definition for serialize():
>>
>>         @override
>> public Object deserialize(Writable obj) throws SerDeException {
>>             ...
>>          }
>>
>> But the problem is that the input argument obj above is only the
>> VALUE portion of a Sequence record. There seems to be a limitation with
the
>> way Hive reads Sequence files. Specifically, for each row in a sequence
>> file, the KEY is ignored and only the VALUE is used by Hive. This is seen
>> from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow()
method
>> below, which ignores the KEY when iterating over a RecordReader (see bold
>> text below from the corresponding Hive code for
>> FetchOperator::getNextRow()):
>>
>>   /**
>>    * Get the next row. The fetch context is modified appropriately.
>>    *
>>    **/
>>   public InspectableObject getNextRow() throws IOException {
>>     try {
>>       while (true) {
>>         if (currRecReader == null) {
>>           currRecReader = getRecordReader();
>>           if (currRecReader == null) {
>>             return null;
>>           }
>>         }
>>
>>         boolean ret = currRecReader.next(key, value);
>>         if (ret) {
>>           if (this.currPart == null) {
>>             Object obj = serde.deserialize(value);
>>             return new InspectableObject(obj,
serde.getObjectInspector());
>>           } else {
>>             rowWithPart[0] = serde.deserialize(value);
>>             return new InspectableObject(rowWithPart,
rowObjectInspector);
>>           }
>>         } else {
>>           currRecReader.close();
>>           currRecReader = null;
>>         }
>>       }
>>     } catch (Exception e) {
>>       throw new IOException(e);
>>     }
>>   }
>>
>> As you can see, the "key" variable is ignored and never returned. The
>> problem is that in the Nutch crawldb Sequence File, the KEY is the URL,
and
>> I need it to be displayed in the Hive table along with the fields of
>> CrawlDatum. But when writing the the custom SerDe, I only see the
CrawlDatum
>> that comes after the key, on each record...which is not sufficient.
>>
>> One hack could be to write a CustomSequenceFileRecordReader.java that
>> returns the offset in the sequence file as the KEY, and an aggregation of
>> the (Key+Value) as th

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Posted by Edward Capriolo <ed...@gmail.com>.
Stored as sequence file is syntax sugar. It sets both the inputformat and
outputformat.

Create table x (thing int)
Inputformat 'class.x'
Outputformat 'class.y'

For inputformat you can use your custom.

For your output format you can stick with hive's ignorekeytextoutputformat
or ignorekeysequencefile format.

To avoid having to write a serde your inputformat could also Chang the
types and format to something hive could easily recognize.

On Saturday, May 5, 2012, Ali Safdar Kureishy <sa...@gmail.com>
wrote:
> Thanks Edward...I feared this was going to be the case.
> If I define a new input format, how do I use it in a hive table
definition?
> For the SequenceFileInputFormat, the table definition would read as
"...STORED AS SEQUENCEFILE".
> With the new one, how do I specify it in the definition? "STORED AS
'com.xyz.abc.MyInputFormat'?
> Thanks,
> Safdar
>
> On Sat, May 5, 2012 at 2:44 PM, Edward Capriolo <ed...@gmail.com>
wrote:
>
> This is one of the things about hive the key is not easily available.
> You are going to need an input format that creates a new value which
> is contains the key and the value.
>
> Like this:
> <url:Text> <data:CrawlDatum> -> <null-writable>  new
> MyKeyValue<<url:Text> <data:CrawlDatum>>
>
>
> On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy
> <sa...@gmail.com> wrote:
>> Hi,
>>
>> I have attached a Sequence file with the following format:
>> <url:Text> <data:CrawlDatum>
>>
>> (CrawlDatum is a custom Java type, that contains several fields that
would
>> be flattened into several columns by the SerDe).
>>
>> In other words, what I would like to do, is to expose this URL+CrawlDatum
>> data via a Hive External table, with the following columns:
>> || url || status || fetchtime || fetchinterval || modifiedtime ||
retries ||
>> score || metadata ||
>>
>> So, I was hoping that after defining a custom SerDe, I would just have to
>> define the Hive table as follows:
>>
>> CREATE EXTERNAL TABLE crawldb
>> (url STRING, status STRING, fetchtime LONG, fetchinterval LONG,
modifiedtime
>> LONG, retries INT, score FLOAT, metadata MAP)
>> ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
>> STORED AS SEQUENCEFILE
>> LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';
>>
>> For example, a sample record should like like the following through a
Hive
>> table:
>> || http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834
||
>> 1 || 0.98 || {x=1,y=2,p=3,q=4} ||
>>
>> I would like this to be possible without having to duplicate/flatten the
>> data through a separate transformation. Initially, I thought my custom
SerDe
>> could have following definition for serialize():
>>
>>         @override
>> public Object deserialize(Writable obj) throws SerDeException {
>>             ...
>>          }
>>
>> But the problem is that the input argument obj above is only the
>> VALUE portion of a Sequence record. There seems to be a limitation with
the
>> way Hive reads Sequence files. Specifically, for each row in a sequence
>> file, the KEY is ignored and only the VALUE is used by Hive. This is seen
>> from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow()
method
>> below, which ignores the KEY when iterating over a RecordReader (see bold
>> text below from the corresponding Hive code for
>> FetchOperator::getNextRow()):
>>
>>   /**
>>    * Get the next row. The fetch context is modified appropriately.
>>    *
>>    **/
>>   public InspectableObject getNextRow() throws IOException {
>>     try {
>>       while (true) {
>>         if (currRecReader == null) {
>>           currRecReader = getRecordReader();
>>           if (currRecReader == null) {
>>             return null;
>>           }
>>         }
>>
>>         boolean ret = currRecReader.next(key, value);
>>         if (ret) {
>>           if (this.currPart == null) {
>>             Object obj = serde.deserialize(value);
>>             return new InspectableObject(obj,
serde.getObjectInspector());
>>           } else {
>>             rowWithPart[0] = serde.deserialize(value);
>>             return new InspectableObject(rowWithPart,
rowObjectInspector);
>>           }
>>         } else {
>>           currRecReader.close();
>>           currRecReader = null;
>>         }
>>       }
>>     } catch (Exception e) {
>>       throw new IOException(e);
>>     }
>>   }
>>
>> As you can see, the "key" variable is ignored and never returned. The
>> problem is that in the Nutch crawldb Sequence File, the KEY is the URL,
and
>> I need it to be displayed in the Hive table along with the fields of
>> CrawlDatum. But when writing the the custom SerDe, I only see the
CrawlDatum
>> that comes after the key, on each record...which is not sufficient.
>>
>> One hack could be to write a CustomSequenceFileRecordReader.java that
>> returns the offset in the sequence file as the KEY, and an aggregation of
>> the (Key+Value) as th

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Posted by Ali Safdar Kureishy <sa...@gmail.com>.
Thanks Edward...I feared this was going to be the case.

If I define a new input format, how do I use it in a hive table definition?

For the SequenceFileInputFormat, the table definition would read as
"...STORED AS SEQUENCEFILE".
With the new one, how do I specify it in the definition? "STORED AS
'com.xyz.abc.MyInputFormat'?

Thanks,
Safdar


On Sat, May 5, 2012 at 2:44 PM, Edward Capriolo <ed...@gmail.com>wrote:

> This is one of the things about hive the key is not easily available.
> You are going to need an input format that creates a new value which
> is contains the key and the value.
>
> Like this:
> <url:Text> <data:CrawlDatum> -> <null-writable>  new
> MyKeyValue<<url:Text> <data:CrawlDatum>>
>
>
> On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy
> <sa...@gmail.com> wrote:
> > Hi,
> >
> > I have attached a Sequence file with the following format:
> > <url:Text> <data:CrawlDatum>
> >
> > (CrawlDatum is a custom Java type, that contains several fields that
> would
> > be flattened into several columns by the SerDe).
> >
> > In other words, what I would like to do, is to expose this URL+CrawlDatum
> > data via a Hive External table, with the following columns:
> > || url || status || fetchtime || fetchinterval || modifiedtime ||
> retries ||
> > score || metadata ||
> >
> > So, I was hoping that after defining a custom SerDe, I would just have to
> > define the Hive table as follows:
> >
> > CREATE EXTERNAL TABLE crawldb
> > (url STRING, status STRING, fetchtime LONG, fetchinterval LONG,
> modifiedtime
> > LONG, retries INT, score FLOAT, metadata MAP)
> > ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
> > STORED AS SEQUENCEFILE
> > LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';
> >
> > For example, a sample record should like like the following through a
> Hive
> > table:
> > || http://www.cnn.com || FETCHED || 125355734857 || 36000 ||
> 12453775834 ||
> > 1 || 0.98 || {x=1,y=2,p=3,q=4} ||
> >
> > I would like this to be possible without having to duplicate/flatten the
> > data through a separate transformation. Initially, I thought my custom
> SerDe
> > could have following definition for serialize():
> >
> >         @override
> > public Object deserialize(Writable obj) throws SerDeException {
> >             ...
> >          }
> >
> > But the problem is that the input argument obj above is only the
> > VALUE portion of a Sequence record. There seems to be a limitation with
> the
> > way Hive reads Sequence files. Specifically, for each row in a sequence
> > file, the KEY is ignored and only the VALUE is used by Hive. This is seen
> > from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow()
> method
> > below, which ignores the KEY when iterating over a RecordReader (see bold
> > text below from the corresponding Hive code for
> > FetchOperator::getNextRow()):
> >
> >   /**
> >    * Get the next row. The fetch context is modified appropriately.
> >    *
> >    **/
> >   public InspectableObject getNextRow() throws IOException {
> >     try {
> >       while (true) {
> >         if (currRecReader == null) {
> >           currRecReader = getRecordReader();
> >           if (currRecReader == null) {
> >             return null;
> >           }
> >         }
> >
> >         boolean ret = currRecReader.next(key, value);
> >         if (ret) {
> >           if (this.currPart == null) {
> >             Object obj = serde.deserialize(value);
> >             return new InspectableObject(obj,
> serde.getObjectInspector());
> >           } else {
> >             rowWithPart[0] = serde.deserialize(value);
> >             return new InspectableObject(rowWithPart,
> rowObjectInspector);
> >           }
> >         } else {
> >           currRecReader.close();
> >           currRecReader = null;
> >         }
> >       }
> >     } catch (Exception e) {
> >       throw new IOException(e);
> >     }
> >   }
> >
> > As you can see, the "key" variable is ignored and never returned. The
> > problem is that in the Nutch crawldb Sequence File, the KEY is the URL,
> and
> > I need it to be displayed in the Hive table along with the fields of
> > CrawlDatum. But when writing the the custom SerDe, I only see the
> CrawlDatum
> > that comes after the key, on each record...which is not sufficient.
> >
> > One hack could be to write a CustomSequenceFileRecordReader.java that
> > returns the offset in the sequence file as the KEY, and an aggregation of
> > the (Key+Value) as the VALUE. For that, perhaps I need to hack the code
> > below from SequenceFileRecordReader, which will get really very messy:
> >   protected synchronized boolean next(K key)
> >     throws IOException {
> >     if (!more) return false;
> >     long pos = in.getPosition();
> >     boolean remaining = (in.next(key) != null);
> >     if (pos >= end && in.syncSeen()) {
> >       more = false;
> >     } else {
> >       more = remaining;
> >     }
> >     return more;
> >   }
> >
> > This would require me to write a CustomSequenceFileRecordReader and a
> > CustomSequenceFileInputFormat and then some custom SerDe, and probably
> make
> > several other changes as well. Is it possible to just get away with
> writing
> > a custom SerDe and some pre-existing reader that includes the key when
> > invoking SerDe.deserialize()? Unless I'm missing something, why does Hive
> > have this limitation, when accessing Sequence files? I would imagine that
> > the key of a sequence file record would be just as important as the
> > value...so why is it left out by the FetchOperator:getNextRow() method?
> >
> > If this is the unfortunate reality with reading sequence files in Nutch,
> is
> > there another Hive storage format I should use that works around this
> > limitation? Such as "create external table ..... STORED AS
> > CUSTOM_SEQUENCEFILE"? Or, let's say I write my own
> > CustomHiveSequenceFileInputFormat, how do i register it with Hive and
> use it
> > in the Hive "STORED AS" definition?
> >
> > Any help or pointers would be greatly appreciated. I hope I'm mistaken
> about
> > the limitation above, and if not, hopefully there is an easy way to
> resolve
> > this through a custom SerDe alone.
> >
> > Warm regards,
> > Safdar
>

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Posted by Ali Safdar Kureishy <sa...@gmail.com>.
Thanks Edward...I feared this was going to be the case.

If I define a new input format, how do I use it in a hive table definition?

For the SequenceFileInputFormat, the table definition would read as
"...STORED AS SEQUENCEFILE".
With the new one, how do I specify it in the definition? "STORED AS
'com.xyz.abc.MyInputFormat'?

Thanks,
Safdar


On Sat, May 5, 2012 at 2:44 PM, Edward Capriolo <ed...@gmail.com>wrote:

> This is one of the things about hive the key is not easily available.
> You are going to need an input format that creates a new value which
> is contains the key and the value.
>
> Like this:
> <url:Text> <data:CrawlDatum> -> <null-writable>  new
> MyKeyValue<<url:Text> <data:CrawlDatum>>
>
>
> On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy
> <sa...@gmail.com> wrote:
> > Hi,
> >
> > I have attached a Sequence file with the following format:
> > <url:Text> <data:CrawlDatum>
> >
> > (CrawlDatum is a custom Java type, that contains several fields that
> would
> > be flattened into several columns by the SerDe).
> >
> > In other words, what I would like to do, is to expose this URL+CrawlDatum
> > data via a Hive External table, with the following columns:
> > || url || status || fetchtime || fetchinterval || modifiedtime ||
> retries ||
> > score || metadata ||
> >
> > So, I was hoping that after defining a custom SerDe, I would just have to
> > define the Hive table as follows:
> >
> > CREATE EXTERNAL TABLE crawldb
> > (url STRING, status STRING, fetchtime LONG, fetchinterval LONG,
> modifiedtime
> > LONG, retries INT, score FLOAT, metadata MAP)
> > ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
> > STORED AS SEQUENCEFILE
> > LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';
> >
> > For example, a sample record should like like the following through a
> Hive
> > table:
> > || http://www.cnn.com || FETCHED || 125355734857 || 36000 ||
> 12453775834 ||
> > 1 || 0.98 || {x=1,y=2,p=3,q=4} ||
> >
> > I would like this to be possible without having to duplicate/flatten the
> > data through a separate transformation. Initially, I thought my custom
> SerDe
> > could have following definition for serialize():
> >
> >         @override
> > public Object deserialize(Writable obj) throws SerDeException {
> >             ...
> >          }
> >
> > But the problem is that the input argument obj above is only the
> > VALUE portion of a Sequence record. There seems to be a limitation with
> the
> > way Hive reads Sequence files. Specifically, for each row in a sequence
> > file, the KEY is ignored and only the VALUE is used by Hive. This is seen
> > from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow()
> method
> > below, which ignores the KEY when iterating over a RecordReader (see bold
> > text below from the corresponding Hive code for
> > FetchOperator::getNextRow()):
> >
> >   /**
> >    * Get the next row. The fetch context is modified appropriately.
> >    *
> >    **/
> >   public InspectableObject getNextRow() throws IOException {
> >     try {
> >       while (true) {
> >         if (currRecReader == null) {
> >           currRecReader = getRecordReader();
> >           if (currRecReader == null) {
> >             return null;
> >           }
> >         }
> >
> >         boolean ret = currRecReader.next(key, value);
> >         if (ret) {
> >           if (this.currPart == null) {
> >             Object obj = serde.deserialize(value);
> >             return new InspectableObject(obj,
> serde.getObjectInspector());
> >           } else {
> >             rowWithPart[0] = serde.deserialize(value);
> >             return new InspectableObject(rowWithPart,
> rowObjectInspector);
> >           }
> >         } else {
> >           currRecReader.close();
> >           currRecReader = null;
> >         }
> >       }
> >     } catch (Exception e) {
> >       throw new IOException(e);
> >     }
> >   }
> >
> > As you can see, the "key" variable is ignored and never returned. The
> > problem is that in the Nutch crawldb Sequence File, the KEY is the URL,
> and
> > I need it to be displayed in the Hive table along with the fields of
> > CrawlDatum. But when writing the the custom SerDe, I only see the
> CrawlDatum
> > that comes after the key, on each record...which is not sufficient.
> >
> > One hack could be to write a CustomSequenceFileRecordReader.java that
> > returns the offset in the sequence file as the KEY, and an aggregation of
> > the (Key+Value) as the VALUE. For that, perhaps I need to hack the code
> > below from SequenceFileRecordReader, which will get really very messy:
> >   protected synchronized boolean next(K key)
> >     throws IOException {
> >     if (!more) return false;
> >     long pos = in.getPosition();
> >     boolean remaining = (in.next(key) != null);
> >     if (pos >= end && in.syncSeen()) {
> >       more = false;
> >     } else {
> >       more = remaining;
> >     }
> >     return more;
> >   }
> >
> > This would require me to write a CustomSequenceFileRecordReader and a
> > CustomSequenceFileInputFormat and then some custom SerDe, and probably
> make
> > several other changes as well. Is it possible to just get away with
> writing
> > a custom SerDe and some pre-existing reader that includes the key when
> > invoking SerDe.deserialize()? Unless I'm missing something, why does Hive
> > have this limitation, when accessing Sequence files? I would imagine that
> > the key of a sequence file record would be just as important as the
> > value...so why is it left out by the FetchOperator:getNextRow() method?
> >
> > If this is the unfortunate reality with reading sequence files in Nutch,
> is
> > there another Hive storage format I should use that works around this
> > limitation? Such as "create external table ..... STORED AS
> > CUSTOM_SEQUENCEFILE"? Or, let's say I write my own
> > CustomHiveSequenceFileInputFormat, how do i register it with Hive and
> use it
> > in the Hive "STORED AS" definition?
> >
> > Any help or pointers would be greatly appreciated. I hope I'm mistaken
> about
> > the limitation above, and if not, hopefully there is an easy way to
> resolve
> > this through a custom SerDe alone.
> >
> > Warm regards,
> > Safdar
>

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Posted by Edward Capriolo <ed...@gmail.com>.
This is one of the things about hive the key is not easily available.
You are going to need an input format that creates a new value which
is contains the key and the value.

Like this:
<url:Text> <data:CrawlDatum> -> <null-writable>  new
MyKeyValue<<url:Text> <data:CrawlDatum>>


On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy
<sa...@gmail.com> wrote:
> Hi,
>
> I have attached a Sequence file with the following format:
> <url:Text> <data:CrawlDatum>
>
> (CrawlDatum is a custom Java type, that contains several fields that would
> be flattened into several columns by the SerDe).
>
> In other words, what I would like to do, is to expose this URL+CrawlDatum
> data via a Hive External table, with the following columns:
> || url || status || fetchtime || fetchinterval || modifiedtime || retries ||
> score || metadata ||
>
> So, I was hoping that after defining a custom SerDe, I would just have to
> define the Hive table as follows:
>
> CREATE EXTERNAL TABLE crawldb
> (url STRING, status STRING, fetchtime LONG, fetchinterval LONG, modifiedtime
> LONG, retries INT, score FLOAT, metadata MAP)
> ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
> STORED AS SEQUENCEFILE
> LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';
>
> For example, a sample record should like like the following through a Hive
> table:
> || http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834 ||
> 1 || 0.98 || {x=1,y=2,p=3,q=4} ||
>
> I would like this to be possible without having to duplicate/flatten the
> data through a separate transformation. Initially, I thought my custom SerDe
> could have following definition for serialize():
>
>         @override
> public Object deserialize(Writable obj) throws SerDeException {
>             ...
>          }
>
> But the problem is that the input argument obj above is only the
> VALUE portion of a Sequence record. There seems to be a limitation with the
> way Hive reads Sequence files. Specifically, for each row in a sequence
> file, the KEY is ignored and only the VALUE is used by Hive. This is seen
> from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow() method
> below, which ignores the KEY when iterating over a RecordReader (see bold
> text below from the corresponding Hive code for
> FetchOperator::getNextRow()):
>
>   /**
>    * Get the next row. The fetch context is modified appropriately.
>    *
>    **/
>   public InspectableObject getNextRow() throws IOException {
>     try {
>       while (true) {
>         if (currRecReader == null) {
>           currRecReader = getRecordReader();
>           if (currRecReader == null) {
>             return null;
>           }
>         }
>
>         boolean ret = currRecReader.next(key, value);
>         if (ret) {
>           if (this.currPart == null) {
>             Object obj = serde.deserialize(value);
>             return new InspectableObject(obj, serde.getObjectInspector());
>           } else {
>             rowWithPart[0] = serde.deserialize(value);
>             return new InspectableObject(rowWithPart, rowObjectInspector);
>           }
>         } else {
>           currRecReader.close();
>           currRecReader = null;
>         }
>       }
>     } catch (Exception e) {
>       throw new IOException(e);
>     }
>   }
>
> As you can see, the "key" variable is ignored and never returned. The
> problem is that in the Nutch crawldb Sequence File, the KEY is the URL, and
> I need it to be displayed in the Hive table along with the fields of
> CrawlDatum. But when writing the the custom SerDe, I only see the CrawlDatum
> that comes after the key, on each record...which is not sufficient.
>
> One hack could be to write a CustomSequenceFileRecordReader.java that
> returns the offset in the sequence file as the KEY, and an aggregation of
> the (Key+Value) as the VALUE. For that, perhaps I need to hack the code
> below from SequenceFileRecordReader, which will get really very messy:
>   protected synchronized boolean next(K key)
>     throws IOException {
>     if (!more) return false;
>     long pos = in.getPosition();
>     boolean remaining = (in.next(key) != null);
>     if (pos >= end && in.syncSeen()) {
>       more = false;
>     } else {
>       more = remaining;
>     }
>     return more;
>   }
>
> This would require me to write a CustomSequenceFileRecordReader and a
> CustomSequenceFileInputFormat and then some custom SerDe, and probably make
> several other changes as well. Is it possible to just get away with writing
> a custom SerDe and some pre-existing reader that includes the key when
> invoking SerDe.deserialize()? Unless I'm missing something, why does Hive
> have this limitation, when accessing Sequence files? I would imagine that
> the key of a sequence file record would be just as important as the
> value...so why is it left out by the FetchOperator:getNextRow() method?
>
> If this is the unfortunate reality with reading sequence files in Nutch, is
> there another Hive storage format I should use that works around this
> limitation? Such as "create external table ..... STORED AS
> CUSTOM_SEQUENCEFILE"? Or, let's say I write my own
> CustomHiveSequenceFileInputFormat, how do i register it with Hive and use it
> in the Hive "STORED AS" definition?
>
> Any help or pointers would be greatly appreciated. I hope I'm mistaken about
> the limitation above, and if not, hopefully there is an easy way to resolve
> this through a custom SerDe alone.
>
> Warm regards,
> Safdar

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Posted by Edward Capriolo <ed...@gmail.com>.
This is one of the things about hive the key is not easily available.
You are going to need an input format that creates a new value which
is contains the key and the value.

Like this:
<url:Text> <data:CrawlDatum> -> <null-writable>  new
MyKeyValue<<url:Text> <data:CrawlDatum>>


On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy
<sa...@gmail.com> wrote:
> Hi,
>
> I have attached a Sequence file with the following format:
> <url:Text> <data:CrawlDatum>
>
> (CrawlDatum is a custom Java type, that contains several fields that would
> be flattened into several columns by the SerDe).
>
> In other words, what I would like to do, is to expose this URL+CrawlDatum
> data via a Hive External table, with the following columns:
> || url || status || fetchtime || fetchinterval || modifiedtime || retries ||
> score || metadata ||
>
> So, I was hoping that after defining a custom SerDe, I would just have to
> define the Hive table as follows:
>
> CREATE EXTERNAL TABLE crawldb
> (url STRING, status STRING, fetchtime LONG, fetchinterval LONG, modifiedtime
> LONG, retries INT, score FLOAT, metadata MAP)
> ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
> STORED AS SEQUENCEFILE
> LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';
>
> For example, a sample record should like like the following through a Hive
> table:
> || http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834 ||
> 1 || 0.98 || {x=1,y=2,p=3,q=4} ||
>
> I would like this to be possible without having to duplicate/flatten the
> data through a separate transformation. Initially, I thought my custom SerDe
> could have following definition for serialize():
>
>         @override
> public Object deserialize(Writable obj) throws SerDeException {
>             ...
>          }
>
> But the problem is that the input argument obj above is only the
> VALUE portion of a Sequence record. There seems to be a limitation with the
> way Hive reads Sequence files. Specifically, for each row in a sequence
> file, the KEY is ignored and only the VALUE is used by Hive. This is seen
> from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow() method
> below, which ignores the KEY when iterating over a RecordReader (see bold
> text below from the corresponding Hive code for
> FetchOperator::getNextRow()):
>
>   /**
>    * Get the next row. The fetch context is modified appropriately.
>    *
>    **/
>   public InspectableObject getNextRow() throws IOException {
>     try {
>       while (true) {
>         if (currRecReader == null) {
>           currRecReader = getRecordReader();
>           if (currRecReader == null) {
>             return null;
>           }
>         }
>
>         boolean ret = currRecReader.next(key, value);
>         if (ret) {
>           if (this.currPart == null) {
>             Object obj = serde.deserialize(value);
>             return new InspectableObject(obj, serde.getObjectInspector());
>           } else {
>             rowWithPart[0] = serde.deserialize(value);
>             return new InspectableObject(rowWithPart, rowObjectInspector);
>           }
>         } else {
>           currRecReader.close();
>           currRecReader = null;
>         }
>       }
>     } catch (Exception e) {
>       throw new IOException(e);
>     }
>   }
>
> As you can see, the "key" variable is ignored and never returned. The
> problem is that in the Nutch crawldb Sequence File, the KEY is the URL, and
> I need it to be displayed in the Hive table along with the fields of
> CrawlDatum. But when writing the the custom SerDe, I only see the CrawlDatum
> that comes after the key, on each record...which is not sufficient.
>
> One hack could be to write a CustomSequenceFileRecordReader.java that
> returns the offset in the sequence file as the KEY, and an aggregation of
> the (Key+Value) as the VALUE. For that, perhaps I need to hack the code
> below from SequenceFileRecordReader, which will get really very messy:
>   protected synchronized boolean next(K key)
>     throws IOException {
>     if (!more) return false;
>     long pos = in.getPosition();
>     boolean remaining = (in.next(key) != null);
>     if (pos >= end && in.syncSeen()) {
>       more = false;
>     } else {
>       more = remaining;
>     }
>     return more;
>   }
>
> This would require me to write a CustomSequenceFileRecordReader and a
> CustomSequenceFileInputFormat and then some custom SerDe, and probably make
> several other changes as well. Is it possible to just get away with writing
> a custom SerDe and some pre-existing reader that includes the key when
> invoking SerDe.deserialize()? Unless I'm missing something, why does Hive
> have this limitation, when accessing Sequence files? I would imagine that
> the key of a sequence file record would be just as important as the
> value...so why is it left out by the FetchOperator:getNextRow() method?
>
> If this is the unfortunate reality with reading sequence files in Nutch, is
> there another Hive storage format I should use that works around this
> limitation? Such as "create external table ..... STORED AS
> CUSTOM_SEQUENCEFILE"? Or, let's say I write my own
> CustomHiveSequenceFileInputFormat, how do i register it with Hive and use it
> in the Hive "STORED AS" definition?
>
> Any help or pointers would be greatly appreciated. I hope I'm mistaken about
> the limitation above, and if not, hopefully there is an easy way to resolve
> this through a custom SerDe alone.
>
> Warm regards,
> Safdar