You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@geode.apache.org by Eric Pederson <er...@gmail.com> on 2015/07/20 21:57:14 UTC

Offline analytics for HDFS integration

In the spec for HDFS integration it says that data events are archived on
HDFS for offline analysis.  How do you do offline analysis?  Is there an
API for the file format so third party tools can read it?  Or do you go
through an HDFS region?

Also, just curious, are you using a LSM-tree to structure the data?

Thanks,

-- Eric

Re: Offline analytics for HDFS integration

Posted by Anilkumar Gingade <ag...@pivotal.io>.

Hi Eric,

In case if you haven't come across...We have GemFire spark connector which
can be used to store/retrieve data from Spark.

https://issues.apache.org/jira/browse/GEODE-9

Thanks,
-Anil.


On Wed, Jul 22, 2015 at 8:31 AM, Eric Pederson <er...@gmail.com> wrote:

> Hi Ashvin:
>
> We are using tools like Spark (and Hive for metadata) to process files in
> HDFS.   We're interested in both the Gemfire RDD and the Gemfire HDFS
> integration as ways to access the data we have in Gemfire using Spark and
> potentially Drill or Impala.
>
> Thanks,
>
>
> -- Eric
>
> On Tue, Jul 21, 2015 at 1:35 AM, Ashvin A <aa...@gmail.com> wrote:
>
>> Hi Eric,
>>
>> Currently HDFS store writes data in sequence file format and HFile
>> format. Each value is a serialized event which contain metadata and the
>> value provided by the user. The value can be deserialized using geode
>> classes. Each file can be deserialized independently and does not depend on
>> a live Geode cluster. A user level api to construct this data will be added
>> soon (see GFInputFormat as an example).
>>
>> HDFS can be used as archive by means of Write-only regions. These regions
>> do not follow LSM-tree structure. LSM structure is used for Read-Write
>> regions.
>>
>> I am planning to create a jira and provide more details. Meanwhile, can
>> you help us understand your use case. In your opinion, what could this
>> interface look like? What about old versions of a key? Do you care for
>> accessing hdfs files directly or is Hdfs Region interface better? Any other
>> information that could be relevant to the hdfs region data access pattern.
>>
>> Thanks
>> Ashvin
>>
>>
>>
>> On Mon, Jul 20, 2015 at 12:57 PM, Eric Pederson <er...@gmail.com>
>> wrote:
>>
>>> In the spec for HDFS integration it says that data events are archived
>>> on HDFS for offline analysis.  How do you do offline analysis?  Is there an
>>> API for the file format so third party tools can read it?  Or do you go
>>> through an HDFS region?
>>>
>>> Also, just curious, are you using a LSM-tree to structure the data?
>>>
>>> Thanks,
>>>
>>> -- Eric
>>>
>>
>>
>

Re: Offline analytics for HDFS integration

Posted by Eric Pederson <er...@gmail.com>.

Hi Ashvin:

We are using tools like Spark (and Hive for metadata) to process files in
HDFS.   We're interested in both the Gemfire RDD and the Gemfire HDFS
integration as ways to access the data we have in Gemfire using Spark and
potentially Drill or Impala.

Thanks,


-- Eric

On Tue, Jul 21, 2015 at 1:35 AM, Ashvin A <aa...@gmail.com> wrote:

> Hi Eric,
>
> Currently HDFS store writes data in sequence file format and HFile format.
> Each value is a serialized event which contain metadata and the value
> provided by the user. The value can be deserialized using geode classes.
> Each file can be deserialized independently and does not depend on a live
> Geode cluster. A user level api to construct this data will be added soon
> (see GFInputFormat as an example).
>
> HDFS can be used as archive by means of Write-only regions. These regions
> do not follow LSM-tree structure. LSM structure is used for Read-Write
> regions.
>
> I am planning to create a jira and provide more details. Meanwhile, can
> you help us understand your use case. In your opinion, what could this
> interface look like? What about old versions of a key? Do you care for
> accessing hdfs files directly or is Hdfs Region interface better? Any other
> information that could be relevant to the hdfs region data access pattern.
>
> Thanks
> Ashvin
>
>
>
> On Mon, Jul 20, 2015 at 12:57 PM, Eric Pederson <er...@gmail.com> wrote:
>
>> In the spec for HDFS integration it says that data events are archived on
>> HDFS for offline analysis.  How do you do offline analysis?  Is there an
>> API for the file format so third party tools can read it?  Or do you go
>> through an HDFS region?
>>
>> Also, just curious, are you using a LSM-tree to structure the data?
>>
>> Thanks,
>>
>> -- Eric
>>
>
>

Re: Offline analytics for HDFS integration

Posted by Ashvin A <aa...@gmail.com>.

Hi Eric,

Currently HDFS store writes data in sequence file format and HFile format.
Each value is a serialized event which contain metadata and the value
provided by the user. The value can be deserialized using geode classes.
Each file can be deserialized independently and does not depend on a live
Geode cluster. A user level api to construct this data will be added soon
(see GFInputFormat as an example).

HDFS can be used as archive by means of Write-only regions. These regions
do not follow LSM-tree structure. LSM structure is used for Read-Write
regions.

I am planning to create a jira and provide more details. Meanwhile, can you
help us understand your use case. In your opinion, what could this
interface look like? What about old versions of a key? Do you care for
accessing hdfs files directly or is Hdfs Region interface better? Any other
information that could be relevant to the hdfs region data access pattern.

Thanks
Ashvin

On Mon, Jul 20, 2015 at 12:57 PM, Eric Pederson <er...@gmail.com> wrote:

> In the spec for HDFS integration it says that data events are archived on
> HDFS for offline analysis.  How do you do offline analysis?  Is there an
> API for the file format so third party tools can read it?  Or do you go
> through an HDFS region?
>
> Also, just curious, are you using a LSM-tree to structure the data?
>
> Thanks,
>
> -- Eric
>