You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by PROJJWAL SAHA <pr...@gmail.com> on 2017/03/09 12:36:06 UTC

Query on .gz.parquet files

All,

one question
i am querying on .gz.parquet files.
select * from xxx returns data like
+---------+
| current |
+---------+
|
{"vendor_id":"VTS","pickup_datetime":"ACj75+tEAAAvfSUA","payment_type":"CSH","fare_amount":12.0,"mta_tax":0.5,"tip_amount":0.0,"tolls_amount":5.33,"total_amount":18.33,"ratecodeid":1.0,"dropoff_datetime":"AEhTi5NFAAAvfSUA","passenger_count":1,"trip_distance":2.93,"extra":0.5,"pickup_geocode":{"Latitude":40.743677,"Longitude":-73.953802},"dropoff_geocode":{"Latitude":40.740917,"Longitude":-73.989298},"PRIMARY_KEY":"8589934600","pickup_geocode_geo_city":"Long
Island
City","pickup_geocode_geo_country":"US","pickup_geocode_geo_postcode":"11109","pickup_geocode_geo_region":"New
York","pickup_geocode_geo_subregion":"Queens
County","pickup_geocode_geo_regionid":"5128638","pickup_geocode_geo_subregionid":"5133268","dropoff_geocode_geo_city":"New
York
City","dropoff_geocode_geo_country":"US","dropoff_geocode_geo_postcode":"10007","dropoff_geocode_geo_region":"New
York","dropoff_geocode_geo_regionid":"5128638"} |.....

it doesnt return in tabular format with headers at the top.

also select count(*) works fine
whereas select count(vendor_id) doesnt work - it returns 0

looks like the header names are not detected.

I have tried adding extractheaders: true for parquet
also tried adding extensions as gz.parquet - it doesnt work

i also have defaultInputFormat as parquet for the workspace.

Any suggestions ?

Regards,
Projjwal

Re: Query on .gz.parquet files

Posted by Kunal Khatua <kk...@mapr.com>.
Is this a Parquet file? It looks more like a JSON document. What is the schema description published by the parquet-tools?



________________________________
From: PROJJWAL SAHA <pr...@gmail.com>
Sent: Thursday, March 9, 2017 4:36:06 AM
To: user@drill.apache.org
Subject: Query on .gz.parquet files

All,

one question
i am querying on .gz.parquet files.
select * from xxx returns data like
+---------+
| current |
+---------+
|
{"vendor_id":"VTS","pickup_datetime":"ACj75+tEAAAvfSUA","payment_type":"CSH","fare_amount":12.0,"mta_tax":0.5,"tip_amount":0.0,"tolls_amount":5.33,"total_amount":18.33,"ratecodeid":1.0,"dropoff_datetime":"AEhTi5NFAAAvfSUA","passenger_count":1,"trip_distance":2.93,"extra":0.5,"pickup_geocode":{"Latitude":40.743677,"Longitude":-73.953802},"dropoff_geocode":{"Latitude":40.740917,"Longitude":-73.989298},"PRIMARY_KEY":"8589934600","pickup_geocode_geo_city":"Long
Island
City","pickup_geocode_geo_country":"US","pickup_geocode_geo_postcode":"11109","pickup_geocode_geo_region":"New
York","pickup_geocode_geo_subregion":"Queens
County","pickup_geocode_geo_regionid":"5128638","pickup_geocode_geo_subregionid":"5133268","dropoff_geocode_geo_city":"New
York
City","dropoff_geocode_geo_country":"US","dropoff_geocode_geo_postcode":"10007","dropoff_geocode_geo_region":"New
York","dropoff_geocode_geo_regionid":"5128638"} |.....

it doesnt return in tabular format with headers at the top.

also select count(*) works fine
whereas select count(vendor_id) doesnt work - it returns 0

looks like the header names are not detected.

I have tried adding extractheaders: true for parquet
also tried adding extensions as gz.parquet - it doesnt work

i also have defaultInputFormat as parquet for the workspace.

Any suggestions ?

Regards,
Projjwal