You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Manik Singla <sm...@gmail.com> on 2019/06/11 12:28:14 UTC
bloomfilter and tokenisation
Hey Team
I have started using parquet recently.
Kind of data I save is something like
*raw hostname cluster serviceName *
where raw is actual log lines.
For raw, dictionary doesn't work as we no 2 log lines are same. But if we
tokenise terms in dictionary, then dictionary can help here to filter out
unwanted rows. For example, parquet is a columnar format will become
"parquet", "is", "a", "columnar", "format".
Also, I see mention of merging bloomfilter not sure if we considering
tokenisation there.
Do we support some out of box to way to tokenise text before dictionary
Also, what are your views if we think to add it
Regards
Manik Singla
+91-9996008893
+91-9665639677
"Life doesn't consist in holding good cards but playing those you hold
well."
Re: bloomfilter and tokenisation
Posted by Wes McKinney <we...@gmail.com>.
Hi Manik,
You could store "raw" as a LIST<BYTE_ARRAY> (so you have to tokenize
in your ETL step) instead of BYTE_ARRAY and you then reap dictionary
encoding benefits.
- Wes
On Wed, Jun 12, 2019 at 12:08 PM Manik Singla <sm...@gmail.com> wrote:
>
> could someone guide on this one
>
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
>
> "Life doesn't consist in holding good cards but playing those you hold
> well."
>
>
> On Tue, Jun 11, 2019 at 5:58 PM Manik Singla <sm...@gmail.com> wrote:
>
> > Hey Team
> >
> > I have started using parquet recently.
> >
> > Kind of data I save is something like
> >
> > *raw hostname cluster serviceName *
> >
> > where raw is actual log lines.
> >
> > For raw, dictionary doesn't work as we no 2 log lines are same. But if we
> > tokenise terms in dictionary, then dictionary can help here to filter out
> > unwanted rows. For example, parquet is a columnar format will become
> > "parquet", "is", "a", "columnar", "format".
> >
> > Also, I see mention of merging bloomfilter not sure if we considering
> > tokenisation there.
> >
> > Do we support some out of box to way to tokenise text before dictionary
> >
> > Also, what are your views if we think to add it
> >
> > Regards
> > Manik Singla
> > +91-9996008893
> > +91-9665639677
> >
> > "Life doesn't consist in holding good cards but playing those you hold
> > well."
> >
Re: bloomfilter and tokenisation
Posted by Manik Singla <sm...@gmail.com>.
could someone guide on this one
Regards
Manik Singla
+91-9996008893
+91-9665639677
"Life doesn't consist in holding good cards but playing those you hold
well."
On Tue, Jun 11, 2019 at 5:58 PM Manik Singla <sm...@gmail.com> wrote:
> Hey Team
>
> I have started using parquet recently.
>
> Kind of data I save is something like
>
> *raw hostname cluster serviceName *
>
> where raw is actual log lines.
>
> For raw, dictionary doesn't work as we no 2 log lines are same. But if we
> tokenise terms in dictionary, then dictionary can help here to filter out
> unwanted rows. For example, parquet is a columnar format will become
> "parquet", "is", "a", "columnar", "format".
>
> Also, I see mention of merging bloomfilter not sure if we considering
> tokenisation there.
>
> Do we support some out of box to way to tokenise text before dictionary
>
> Also, what are your views if we think to add it
>
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
>
> "Life doesn't consist in holding good cards but playing those you hold
> well."
>