You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@asterixdb.apache.org by Mike Carey <dt...@gmail.com> on 2016/02/23 05:50:01 UTC

Re: Limitation of the current TweetParser

We should definitely not be pulling in a subset of fields at the entry 
point - that's what the UDF is for (it can trim off or add or convert 
fields) - agreed.  Why not have the out-of-the-box adaptor simply keep 
all of the fields in their incoming form?  Maybe something we'd need for 
extra credit would be - if the data is targeted at a dataset with "more 
schema" then the incoming wide open records - the ability to do field 
level type conversions at the point of entry into a dataset by calling 
the appropriate constructors with the incoming string values?

On 2/22/16 4:46 PM, Jianfeng Jia wrote:
> Dear devs,
>
> TwitterFeedAdapter is nice, but the internal TweetParser have some limitations.
> 1. We only pick a few JSON field, e.g. user, geolocation, message field. I need the place field. Also there are also some other fields the other application may also interested in.
> 2. The text fields always call getNormalizedString() to filter out the non-ascii chars, which is a big loss of information. Even for the English txt there are emojis which are not “nomal”
>
> Apparently we can add the entire twitter structure into this parser. I’m wondering if the current one-to-one mapping between Adapter and Parser design is the best approach? The twitter data itself changes. Also there are a lot of interesting open data resources, e.g. Instagram,FaceBook, Weibo, Reddit ….  Could we have a general approach for all these data sources?
>
> I’m thinking to have some field level JSON to ADM parsers (int,double,string,binary,point,time,polygon…). Then by given the schema option through Adapter we can easily assemble the field into one record. The schema option could be a field mapping between original JSON id and the ADM type, e.g. { “id”:Int64, “user”: { “userid”: int64,..} }. As such, we don’t have to write the specific parser for different data source.
>
> Another thoughts is to just give the JSON object as it is, and rely on the user’s UDF to parse the data. Again, even in this case, user can selectively override several field parsers that are different from ours.
>
> Any thoughts?
>
>
> Best,
>
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine
>
>

Re: Limitation of the current TweetParser

Posted by Yingyi Bu <bu...@gmail.com>.

>> As for the cast-record, if we can add advanced type converting that will
be great.

I guess the flow could be a top-level JSON object (tuple) --> fully open
Asterix Record --> record with a required type.
To change the cast-record function, you can take a look at the code here:
https://github.com/apache/incubator-asterixdb/tree/master/asterix-om/src/main/java/org/apache/asterix/om/pointables/cast

Best,
Yingyi


On Mon, Feb 22, 2016 at 10:40 PM, Jianfeng Jia <ji...@gmail.com>
wrote:

> I’ve created an issue 1318 <
> https://issues.apache.org/jira/browse/ASTERIXDB-1318> wrt recovering the
> missing fields from the Twitter Stream JSON.
>
> As for the cast-record, if we can add advanced type converting that will
> be great.
>
> > On Feb 22, 2016, at 10:06 PM, Yingyi Bu <bu...@gmail.com> wrote:
> >
> >>> Maybe something we'd need for extra credit would be - if the data is
> > targeted at a dataset with "more schema" then the incoming wide open
> > records - >> the ability to do field level type conversions at the point
> of
> > entry into a dataset by calling the appropriate constructors with the
> > incoming string values?
> >
> > I guess we can have an enhanced version of the cast-record function to do
> > that?  It already considers the combination of complex types,
> > open-closeness, and type promotions.  Maybe we can to enhance that with
> > temporal/spatial constructors?
> >
> > Best,
> > Yingyi
> >
> >
> > On Mon, Feb 22, 2016 at 8:50 PM, Mike Carey <dt...@gmail.com> wrote:
> >
> >> We should definitely not be pulling in a subset of fields at the entry
> >> point - that's what the UDF is for (it can trim off or add or convert
> >> fields) - agreed.  Why not have the out-of-the-box adaptor simply keep
> all
> >> of the fields in their incoming form?  Maybe something we'd need for
> extra
> >> credit would be - if the data is targeted at a dataset with "more
> schema"
> >> then the incoming wide open records - the ability to do field level type
> >> conversions at the point of entry into a dataset by calling the
> appropriate
> >> constructors with the incoming string values?
> >>
> >>
> >> On 2/22/16 4:46 PM, Jianfeng Jia wrote:
> >>
> >>> Dear devs,
> >>>
> >>> TwitterFeedAdapter is nice, but the internal TweetParser have some
> >>> limitations.
> >>> 1. We only pick a few JSON field, e.g. user, geolocation, message
> field.
> >>> I need the place field. Also there are also some other fields the other
> >>> application may also interested in.
> >>> 2. The text fields always call getNormalizedString() to filter out the
> >>> non-ascii chars, which is a big loss of information. Even for the
> English
> >>> txt there are emojis which are not “nomal”
> >>>
> >>> Apparently we can add the entire twitter structure into this parser.
> I’m
> >>> wondering if the current one-to-one mapping between Adapter and Parser
> >>> design is the best approach? The twitter data itself changes. Also
> there
> >>> are a lot of interesting open data resources, e.g. Instagram,FaceBook,
> >>> Weibo, Reddit ….  Could we have a general approach for all these data
> >>> sources?
> >>>
> >>> I’m thinking to have some field level JSON to ADM parsers
> >>> (int,double,string,binary,point,time,polygon…). Then by given the
> schema
> >>> option through Adapter we can easily assemble the field into one
> record.
> >>> The schema option could be a field mapping between original JSON id
> and the
> >>> ADM type, e.g. { “id”:Int64, “user”: { “userid”: int64,..} }. As such,
> we
> >>> don’t have to write the specific parser for different data source.
> >>>
> >>> Another thoughts is to just give the JSON object as it is, and rely on
> >>> the user’s UDF to parse the data. Again, even in this case, user can
> >>> selectively override several field parsers that are different from
> ours.
> >>>
> >>> Any thoughts?
> >>>
> >>>
> >>> Best,
> >>>
> >>> Jianfeng Jia
> >>> PhD Candidate of Computer Science
> >>> University of California, Irvine
> >>>
> >>>
> >>>
> >>
>
>
>
> Best,
>
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine
>
>

Re: Limitation of the current TweetParser

Posted by Jianfeng Jia <ji...@gmail.com>.

I’ve created an issue 1318 <https://issues.apache.org/jira/browse/ASTERIXDB-1318> wrt recovering the missing fields from the Twitter Stream JSON.

As for the cast-record, if we can add advanced type converting that will be great. 

> On Feb 22, 2016, at 10:06 PM, Yingyi Bu <bu...@gmail.com> wrote:
> 
>>> Maybe something we'd need for extra credit would be - if the data is
> targeted at a dataset with "more schema" then the incoming wide open
> records - >> the ability to do field level type conversions at the point of
> entry into a dataset by calling the appropriate constructors with the
> incoming string values?
> 
> I guess we can have an enhanced version of the cast-record function to do
> that?  It already considers the combination of complex types,
> open-closeness, and type promotions.  Maybe we can to enhance that with
> temporal/spatial constructors?
> 
> Best,
> Yingyi
> 
> 
> On Mon, Feb 22, 2016 at 8:50 PM, Mike Carey <dt...@gmail.com> wrote:
> 
>> We should definitely not be pulling in a subset of fields at the entry
>> point - that's what the UDF is for (it can trim off or add or convert
>> fields) - agreed.  Why not have the out-of-the-box adaptor simply keep all
>> of the fields in their incoming form?  Maybe something we'd need for extra
>> credit would be - if the data is targeted at a dataset with "more schema"
>> then the incoming wide open records - the ability to do field level type
>> conversions at the point of entry into a dataset by calling the appropriate
>> constructors with the incoming string values?
>> 
>> 
>> On 2/22/16 4:46 PM, Jianfeng Jia wrote:
>> 
>>> Dear devs,
>>> 
>>> TwitterFeedAdapter is nice, but the internal TweetParser have some
>>> limitations.
>>> 1. We only pick a few JSON field, e.g. user, geolocation, message field.
>>> I need the place field. Also there are also some other fields the other
>>> application may also interested in.
>>> 2. The text fields always call getNormalizedString() to filter out the
>>> non-ascii chars, which is a big loss of information. Even for the English
>>> txt there are emojis which are not “nomal”
>>> 
>>> Apparently we can add the entire twitter structure into this parser. I’m
>>> wondering if the current one-to-one mapping between Adapter and Parser
>>> design is the best approach? The twitter data itself changes. Also there
>>> are a lot of interesting open data resources, e.g. Instagram,FaceBook,
>>> Weibo, Reddit ….  Could we have a general approach for all these data
>>> sources?
>>> 
>>> I’m thinking to have some field level JSON to ADM parsers
>>> (int,double,string,binary,point,time,polygon…). Then by given the schema
>>> option through Adapter we can easily assemble the field into one record.
>>> The schema option could be a field mapping between original JSON id and the
>>> ADM type, e.g. { “id”:Int64, “user”: { “userid”: int64,..} }. As such, we
>>> don’t have to write the specific parser for different data source.
>>> 
>>> Another thoughts is to just give the JSON object as it is, and rely on
>>> the user’s UDF to parse the data. Again, even in this case, user can
>>> selectively override several field parsers that are different from ours.
>>> 
>>> Any thoughts?
>>> 
>>> 
>>> Best,
>>> 
>>> Jianfeng Jia
>>> PhD Candidate of Computer Science
>>> University of California, Irvine
>>> 
>>> 
>>> 
>> 



Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine

Re: Limitation of the current TweetParser

Posted by Yingyi Bu <bu...@gmail.com>.

>> Maybe something we'd need for extra credit would be - if the data is
targeted at a dataset with "more schema" then the incoming wide open
records - >> the ability to do field level type conversions at the point of
entry into a dataset by calling the appropriate constructors with the
incoming string values?

I guess we can have an enhanced version of the cast-record function to do
that?  It already considers the combination of complex types,
open-closeness, and type promotions.  Maybe we can to enhance that with
temporal/spatial constructors?

Best,
Yingyi


On Mon, Feb 22, 2016 at 8:50 PM, Mike Carey <dt...@gmail.com> wrote:

> We should definitely not be pulling in a subset of fields at the entry
> point - that's what the UDF is for (it can trim off or add or convert
> fields) - agreed.  Why not have the out-of-the-box adaptor simply keep all
> of the fields in their incoming form?  Maybe something we'd need for extra
> credit would be - if the data is targeted at a dataset with "more schema"
> then the incoming wide open records - the ability to do field level type
> conversions at the point of entry into a dataset by calling the appropriate
> constructors with the incoming string values?
>
>
> On 2/22/16 4:46 PM, Jianfeng Jia wrote:
>
>> Dear devs,
>>
>> TwitterFeedAdapter is nice, but the internal TweetParser have some
>> limitations.
>> 1. We only pick a few JSON field, e.g. user, geolocation, message field.
>> I need the place field. Also there are also some other fields the other
>> application may also interested in.
>> 2. The text fields always call getNormalizedString() to filter out the
>> non-ascii chars, which is a big loss of information. Even for the English
>> txt there are emojis which are not “nomal”
>>
>> Apparently we can add the entire twitter structure into this parser. I’m
>> wondering if the current one-to-one mapping between Adapter and Parser
>> design is the best approach? The twitter data itself changes. Also there
>> are a lot of interesting open data resources, e.g. Instagram,FaceBook,
>> Weibo, Reddit ….  Could we have a general approach for all these data
>> sources?
>>
>> I’m thinking to have some field level JSON to ADM parsers
>> (int,double,string,binary,point,time,polygon…). Then by given the schema
>> option through Adapter we can easily assemble the field into one record.
>> The schema option could be a field mapping between original JSON id and the
>> ADM type, e.g. { “id”:Int64, “user”: { “userid”: int64,..} }. As such, we
>> don’t have to write the specific parser for different data source.
>>
>> Another thoughts is to just give the JSON object as it is, and rely on
>> the user’s UDF to parse the data. Again, even in this case, user can
>> selectively override several field parsers that are different from ours.
>>
>> Any thoughts?
>>
>>
>> Best,
>>
>> Jianfeng Jia
>> PhD Candidate of Computer Science
>> University of California, Irvine
>>
>>
>>
>