You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by franca perrina <f....@gmail.com> on 2017/05/05 16:03:58 UTC

Re: how to use LIKE operator with a binary column

Hello,

Thanks for your answers.

My payload is not encoded in UTF8, it can contain some non printable
characters, new lines and it may contain bytes that are not valid in UTF8.
An the latter should be my case.

I have tried with regexp_matches:

SELECT * FROM  `dfs`.`myfile.avro` WHERE regexp_matches(payload,
'(?s).*abcd.*');

but I have the same problem, and I have the same error if I do, obviously,

SELECT CAST(payload as VARCHAR) FROM `dfs`.`myfile.avro`;

So, I implemented an UDF function to convert the bytes into a hex encoded
string


public class AsciiStringBinaryFunc {

 // Converts a varbinary type into a hex encoded string.
 // (byte[]) {(byte)0xca, (byte)0xfe, (byte)0xba, (byte)0xbe}  =>
"\xca\xfe\xba\xbe"
 @FunctionTemplate(name = "ascii_string_binary", scope =
FunctionScope.SIMPLE, nulls = NullHandling.NULL_IF_NULL)
 public static class StringBinary implements DrillSimpleFunc {
   @Param VarBinaryHolder in;
   @Output VarCharHolder   out;
   @Workspace Charset charset;
   @Inject DrillBuf buffer;

   @Override
   public void setup() {
     charset = java.nio.charset.Charset.forName("US-ASCII");
   }

   @Override
   public void eval() {
     byte[] buf =
org.apache.drill.common.util.DrillStringUtils.toBinaryString(in.buffer,
in.start, in.end).getBytes(charset);
     buffer.setBytes(0, buf);
     buffer.setIndex(0, buf.length);

     out.start = 0;
     out.end = buf.length;
     out.buffer = buffer;
   }
 }
}

but then, I have a new problem

SELECT ascii_string_binary(payload) FROM `dfs`.`myfile.avro` LIMIT 1;

Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 0, length: 3484
(expected: range(0, 256))

Fragment 0:0

[Error Id: d0ab90d6-8b2a-4200-8809-534138c217fb on maprdemo:31010]
(state=,code=0)


knowing that

SELECT length(payload) FROM `dfs`.`myfile.avro` LIMIT 1;

+---------+
| EXPR$0  |
+---------+
| 3484    |
+---------+




Thanks a lot for your help,
Franca

On Sat, Apr 29, 2017 at 12:13 AM, Jinfeng Ni <jn...@apache.org> wrote:

> The error seems to indicated 'PAYLOAD' does not contain UTF8-encoded
> bytes. The like function is a string function, and it only accepts
> varchar/char type, which assumes inputs are UTF8 bytes.
>
> You may consider implementing a Drill UDF 'blike" which works similar
> to string function 'like', but could operate on non-UTF8 bytes.
>
> On Fri, Apr 28, 2017 at 3:02 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
> >  Hi Franca,
> >
> >     This issue is specific to the “bytes” type; for other Avro types the
> LIKE clause matches the printed representation, like:
> >
> > select * from dfs.`/data/avro/twitter.snappy.avro` where `timestamp`
> like '%66%';
> > +-------------+--------------------------------------+-------------+
> > |  username   |                tweet                 |  timestamp  |
> > +-------------+--------------------------------------+-------------+
> > | miguno      | Rock: Nerf paper, scissors is fine.  | 1366150681  |
> > | BlizzardCS  | Works as intended.  Terran is IMBA.  | 1366154481  |
> > +-------------+--------------------------------------+-------------+
> >
> > Can you share some sample avro file with “bytes” type?  (I couldn’t find
> any such sample online)     Maybe we’ll need to open a Jira for this case …
> >
> >      Thanks,
> >
> >              -- Boaz
> >
> > On 4/25/17, 8:45 AM, "franca perrina" <f....@gmail.com> wrote:
> >
> >     Hi,
> >
> >     I would like to use Drill to query data formatted in avro.
> >
> >     My avro schema looks like
> >
> >     ..
> >     {"name":"payload",
> >       "type":"bytes"}
> >     ..
> >
> >     and the result to the query
> >
> >     SELECT payload FROM `dfs`.`myfile.avro` LIMIT 1
> >
> >     looks like:
> >
> >     +-----------------+
> >     |     payload     |
> >     +-----------------+
> >     | [B@3b8e004e     |
> >     +-----------------+
> >
> >
> >     My problem is that when I run a query like:
> >
> >     SELECT * FROM  `dfs`.`myfile.avro` WHERE `PAYLOAD` LIKE '%abcd%'
> >
> >     then I have
> >     org.apache.drill.common.exceptions.UserRemoteException: SYSTEM
> ERROR:
> >     DrillRuntimeException: Unexpected byte 0xfd at position 1008556
> encountered
> >     while decoding UTF8 string. Fragment 0:0 [Error Id:
> >     0c247c14-0e51-402c-ad9a-411cbc445597
> >     on maprdemo:31010]
> >
> >     It seems like drill tries to decode the payload's bytes to UTF8.
> >
> >     What I would need is a grep like behaviour, where my payload data is
> >     considered as is, i.e. binary data, and it is not converted to a
> string
> >     data type.
> >
> >     Thanks a lot for your help.
> >     franca
> >
> >
>

Re: how to use LIKE operator with a binary column

Posted by franca perrina <f....@gmail.com>.
I found the way to resize the buffer

            @Override
   public void eval() {

     byte[] buf =
org.apache.drill.common.util.DrillStringUtils.toBinaryString(in.buffer,
in.start, in.end).getBytes(charset);
*      out.buffer = buffer = buffer.reallocIfNeeded(buf.length);*
...
   }

Thanks a lot for you help.

Franca

On Fri, May 5, 2017 at 6:03 PM, franca perrina <f....@gmail.com> wrote:

> Hello,
>
> Thanks for your answers.
>
> My payload is not encoded in UTF8, it can contain some non printable
> characters, new lines and it may contain bytes that are not valid in UTF8.
> An the latter should be my case.
>
> I have tried with regexp_matches:
>
> SELECT * FROM  `dfs`.`myfile.avro` WHERE regexp_matches(payload,
> '(?s).*abcd.*');
>
> but I have the same problem, and I have the same error if I do,
> obviously,
>
> SELECT CAST(payload as VARCHAR) FROM `dfs`.`myfile.avro`;
>
> So, I implemented an UDF function to convert the bytes into a hex encoded
> string
>
>
> public class AsciiStringBinaryFunc {
>
>  // Converts a varbinary type into a hex encoded string.
>  // (byte[]) {(byte)0xca, (byte)0xfe, (byte)0xba, (byte)0xbe}  =>
> "\xca\xfe\xba\xbe"
>  @FunctionTemplate(name = "ascii_string_binary", scope =
> FunctionScope.SIMPLE, nulls = NullHandling.NULL_IF_NULL)
>  public static class StringBinary implements DrillSimpleFunc {
>    @Param VarBinaryHolder in;
>    @Output VarCharHolder   out;
>    @Workspace Charset charset;
>    @Inject DrillBuf buffer;
>
>    @Override
>    public void setup() {
>      charset = java.nio.charset.Charset.forName("US-ASCII");
>    }
>
>    @Override
>    public void eval() {
>      byte[] buf = org.apache.drill.common.util.DrillStringUtils.toBinaryString(in.buffer,
> in.start, in.end).getBytes(charset);
>      buffer.setBytes(0, buf);
>      buffer.setIndex(0, buf.length);
>
>      out.start = 0;
>      out.end = buf.length;
>      out.buffer = buffer;
>    }
>  }
> }
>
> but then, I have a new problem
>
> SELECT ascii_string_binary(payload) FROM `dfs`.`myfile.avro` LIMIT 1;
>
> Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 0, length: 3484
> (expected: range(0, 256))
>
> Fragment 0:0
>
> [Error Id: d0ab90d6-8b2a-4200-8809-534138c217fb on maprdemo:31010]
> (state=,code=0)
>
>
> knowing that
>
> SELECT length(payload) FROM `dfs`.`myfile.avro` LIMIT 1;
>
> +---------+
> | EXPR$0  |
> +---------+
> | 3484    |
> +---------+
>
>
>
>
> Thanks a lot for your help,
> Franca
>
> On Sat, Apr 29, 2017 at 12:13 AM, Jinfeng Ni <jn...@apache.org> wrote:
>
>> The error seems to indicated 'PAYLOAD' does not contain UTF8-encoded
>> bytes. The like function is a string function, and it only accepts
>> varchar/char type, which assumes inputs are UTF8 bytes.
>>
>> You may consider implementing a Drill UDF 'blike" which works similar
>> to string function 'like', but could operate on non-UTF8 bytes.
>>
>> On Fri, Apr 28, 2017 at 3:02 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
>> >  Hi Franca,
>> >
>> >     This issue is specific to the “bytes” type; for other Avro types
>> the LIKE clause matches the printed representation, like:
>> >
>> > select * from dfs.`/data/avro/twitter.snappy.avro` where `timestamp`
>> like '%66%';
>> > +-------------+--------------------------------------+-------------+
>> > |  username   |                tweet                 |  timestamp  |
>> > +-------------+--------------------------------------+-------------+
>> > | miguno      | Rock: Nerf paper, scissors is fine.  | 1366150681  |
>> > | BlizzardCS  | Works as intended.  Terran is IMBA.  | 1366154481  |
>> > +-------------+--------------------------------------+-------------+
>> >
>> > Can you share some sample avro file with “bytes” type?  (I couldn’t
>> find any such sample online)     Maybe we’ll need to open a Jira for this
>> case …
>> >
>> >      Thanks,
>> >
>> >              -- Boaz
>> >
>> > On 4/25/17, 8:45 AM, "franca perrina" <f....@gmail.com> wrote:
>> >
>> >     Hi,
>> >
>> >     I would like to use Drill to query data formatted in avro.
>> >
>> >     My avro schema looks like
>> >
>> >     ..
>> >     {"name":"payload",
>> >       "type":"bytes"}
>> >     ..
>> >
>> >     and the result to the query
>> >
>> >     SELECT payload FROM `dfs`.`myfile.avro` LIMIT 1
>> >
>> >     looks like:
>> >
>> >     +-----------------+
>> >     |     payload     |
>> >     +-----------------+
>> >     | [B@3b8e004e     |
>> >     +-----------------+
>> >
>> >
>> >     My problem is that when I run a query like:
>> >
>> >     SELECT * FROM  `dfs`.`myfile.avro` WHERE `PAYLOAD` LIKE '%abcd%'
>> >
>> >     then I have
>> >     org.apache.drill.common.exceptions.UserRemoteException: SYSTEM
>> ERROR:
>> >     DrillRuntimeException: Unexpected byte 0xfd at position 1008556
>> encountered
>> >     while decoding UTF8 string. Fragment 0:0 [Error Id:
>> >     0c247c14-0e51-402c-ad9a-411cbc445597
>> >     on maprdemo:31010]
>> >
>> >     It seems like drill tries to decode the payload's bytes to UTF8.
>> >
>> >     What I would need is a grep like behaviour, where my payload data is
>> >     considered as is, i.e. binary data, and it is not converted to a
>> string
>> >     data type.
>> >
>> >     Thanks a lot for your help.
>> >     franca
>> >
>> >
>>
>
>