You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Manik Singla <sm...@gmail.com> on 2019/06/11 12:28:14 UTC

bloomfilter and tokenisation

Hey Team

I have started using parquet recently.

Kind of data I save is something like

*raw   hostname cluster serviceName  *

where raw is actual log lines.

For raw, dictionary doesn't work as we no 2 log lines are same. But if we
tokenise terms in dictionary, then dictionary can help here to filter out
unwanted rows.  For example, parquet is a columnar format will become
"parquet", "is", "a", "columnar", "format".

Also, I see mention of merging bloomfilter not sure if we considering
tokenisation there.

Do we support some out of box to way to tokenise text before dictionary

Also, what are your views if we think to add it

Regards
Manik Singla
+91-9996008893
+91-9665639677

"Life doesn't consist in holding good cards but playing those you hold
well."

Re: bloomfilter and tokenisation

Posted by Wes McKinney <we...@gmail.com>.

Hi Manik,

You could store "raw" as a LIST<BYTE_ARRAY> (so you have to tokenize
in your ETL step) instead of BYTE_ARRAY and you then reap dictionary
encoding benefits.

- Wes

On Wed, Jun 12, 2019 at 12:08 PM Manik Singla <sm...@gmail.com> wrote:
>
> could someone guide on this one
>
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
>
> "Life doesn't consist in holding good cards but playing those you hold
> well."
>
>
> On Tue, Jun 11, 2019 at 5:58 PM Manik Singla <sm...@gmail.com> wrote:
>
> > Hey Team
> >
> > I have started using parquet recently.
> >
> > Kind of data I save is something like
> >
> > *raw   hostname cluster serviceName  *
> >
> > where raw is actual log lines.
> >
> > For raw, dictionary doesn't work as we no 2 log lines are same. But if we
> > tokenise terms in dictionary, then dictionary can help here to filter out
> > unwanted rows.  For example, parquet is a columnar format will become
> > "parquet", "is", "a", "columnar", "format".
> >
> > Also, I see mention of merging bloomfilter not sure if we considering
> > tokenisation there.
> >
> > Do we support some out of box to way to tokenise text before dictionary
> >
> > Also, what are your views if we think to add it
> >
> > Regards
> > Manik Singla
> > +91-9996008893
> > +91-9665639677
> >
> > "Life doesn't consist in holding good cards but playing those you hold
> > well."
> >

Re: bloomfilter and tokenisation

Posted by Manik Singla <sm...@gmail.com>.

could someone guide on this one

Regards
Manik Singla
+91-9996008893
+91-9665639677

"Life doesn't consist in holding good cards but playing those you hold
well."


On Tue, Jun 11, 2019 at 5:58 PM Manik Singla <sm...@gmail.com> wrote:

> Hey Team
>
> I have started using parquet recently.
>
> Kind of data I save is something like
>
> *raw   hostname cluster serviceName  *
>
> where raw is actual log lines.
>
> For raw, dictionary doesn't work as we no 2 log lines are same. But if we
> tokenise terms in dictionary, then dictionary can help here to filter out
> unwanted rows.  For example, parquet is a columnar format will become
> "parquet", "is", "a", "columnar", "format".
>
> Also, I see mention of merging bloomfilter not sure if we considering
> tokenisation there.
>
> Do we support some out of box to way to tokenise text before dictionary
>
> Also, what are your views if we think to add it
>
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
>
> "Life doesn't consist in holding good cards but playing those you hold
> well."
>