You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by venito camelas <ro...@gmail.com> on 2016/07/07 23:40:40 UTC

Help designing application architecture

I'm pretty new to this and I have a use case I'm not sure how to implement,
I'll try to explain it and I'd appreciate if anyone could point me in the
right direction.

The case has these requirements:
 1 - Any user shoud be able to define the format of the information they
want to store (channel). For example, user X defines a channel named
"coordinate":
coordinate = {
"X" : "Float",
"Y" : "Float",
"instant" : "Timestamp"
}
  Every channel has some time value, it can be an instant (like above) or a
period of time ("start" : "Timestamp", "end" : "Timestamp")

 2 - Given the previous example, the user should be able to ask the
following questions:
2.1 When was the last time I went near {X : x, Y : y}?  --> Process the
information in order to get the "near" places and return the newest one.
2.2 Where was I on march 6th between 1pm and 2pm?       --> Query by time



For 1) I was thinking of using some Document oriented storage because of
the channels lack of structure, not sure that's the only thing to consider
though.

For 2.1) I'd use some MR job

For 2.2) I think it would be better to have the information in the document
storage and make the queries there.

Is it a good approach to have the information stored both in the hdfs and
the document oriented storage (for processing and querying respectively)?

As I mentioned in the beginning, I'm really new to this and I'm just trying
to learn..so sorry if my doubts are silly.

Any suggestion or any good reference related to this will be much
appreciated.

Re: Help designing application architecture

Posted by venito camelas <ro...@gmail.com>.
Sorry but I did not understand.
For what I see case classes are scala, I'm using java (I could consider
learn and change to scala because I have not started yet and its for
learning purposes only)

What do you mean with known formats? When the user creates a channel he
only has some basic types (string, long, timestamp, etc) and some channels
previously created (by him) to choose from. Example:

The user first creates 2 simple channels (Coordinate and Temperature):
Coordinate = {
"X" : "Float",
"Y" : "Float",
"instant" : "Timestamp"
}

Temperature{
"value" : "Float",
"measurement_unit" : "String",
"instant" : "Timestamp"
}

Then, the user creates a new channel using the 2 previously created:
Measurement{
"coord" : "Coordinate",
"temp" : "Temperature",
"instant" : "Timestamp"
}


Now, when the data comes I validate its format against the defined
channel's format, if it does't match I throw an error. Example:

{
"coord" : {
"X" : 31.75,
"Y" : "32.75"
"instant" : "2016-06-20T13:28:06.419Z"
},
"temp" : {
"value" : 25.6,
"measurement_unit" : "Celsius",
"instant" : "2016-06-20T13:28:06.419Z"
},
"instant" : "2016-06-20T13:28:06.419Z"
}

That piece of data will fail validation cause the "Y" value does't have
Float type (as defined in the Coordinate channel).

Is there a chance you could explain a little more what you said previously?
will really help me.

Thank you

2016-07-07 20:54 GMT-03:00 Ted Yu <yu...@gmail.com>:

> For 1) you don't have to introduce external storage.
>
> You can define case classes for the known formats.
>
> FYI
>
> On Thu, Jul 7, 2016 at 4:40 PM, venito camelas <ro...@gmail.com>
> wrote:
>
>> I'm pretty new to this and I have a use case I'm not sure how to
>> implement, I'll try to explain it and I'd appreciate if anyone could point
>> me in the right direction.
>>
>> The case has these requirements:
>>  1 - Any user shoud be able to define the format of the information they
>> want to store (channel). For example, user X defines a channel named
>> "coordinate":
>> coordinate = {
>> "X" : "Float",
>> "Y" : "Float",
>> "instant" : "Timestamp"
>> }
>>   Every channel has some time value, it can be an instant (like above) or
>> a period of time ("start" : "Timestamp", "end" : "Timestamp")
>>
>>  2 - Given the previous example, the user should be able to ask the
>> following questions:
>> 2.1 When was the last time I went near {X : x, Y : y}?  --> Process the
>> information in order to get the "near" places and return the newest one.
>> 2.2 Where was I on march 6th between 1pm and 2pm?       --> Query by time
>>
>>
>>
>> For 1) I was thinking of using some Document oriented storage because of
>> the channels lack of structure, not sure that's the only thing to consider
>> though.
>>
>> For 2.1) I'd use some MR job
>>
>> For 2.2) I think it would be better to have the information in the
>> document storage and make the queries there.
>>
>> Is it a good approach to have the information stored both in the hdfs and
>> the document oriented storage (for processing and querying respectively)?
>>
>> As I mentioned in the beginning, I'm really new to this and I'm just
>> trying to learn..so sorry if my doubts are silly.
>>
>> Any suggestion or any good reference related to this will be much
>> appreciated.
>>
>
>

Re: Help designing application architecture

Posted by Ted Yu <yu...@gmail.com>.
For 1) you don't have to introduce external storage.

You can define case classes for the known formats.

FYI

On Thu, Jul 7, 2016 at 4:40 PM, venito camelas <ro...@gmail.com>
wrote:

> I'm pretty new to this and I have a use case I'm not sure how to
> implement, I'll try to explain it and I'd appreciate if anyone could point
> me in the right direction.
>
> The case has these requirements:
>  1 - Any user shoud be able to define the format of the information they
> want to store (channel). For example, user X defines a channel named
> "coordinate":
> coordinate = {
> "X" : "Float",
> "Y" : "Float",
> "instant" : "Timestamp"
> }
>   Every channel has some time value, it can be an instant (like above) or
> a period of time ("start" : "Timestamp", "end" : "Timestamp")
>
>  2 - Given the previous example, the user should be able to ask the
> following questions:
> 2.1 When was the last time I went near {X : x, Y : y}?  --> Process the
> information in order to get the "near" places and return the newest one.
> 2.2 Where was I on march 6th between 1pm and 2pm?       --> Query by time
>
>
>
> For 1) I was thinking of using some Document oriented storage because of
> the channels lack of structure, not sure that's the only thing to consider
> though.
>
> For 2.1) I'd use some MR job
>
> For 2.2) I think it would be better to have the information in the
> document storage and make the queries there.
>
> Is it a good approach to have the information stored both in the hdfs and
> the document oriented storage (for processing and querying respectively)?
>
> As I mentioned in the beginning, I'm really new to this and I'm just
> trying to learn..so sorry if my doubts are silly.
>
> Any suggestion or any good reference related to this will be much
> appreciated.
>