You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Mathieu Agneray <ma...@gmail.com> on 2015/09/17 15:43:01 UTC

CSV Bug

Hy,

I'm having an issue with Drill file format.
I have a CSV file that has space delimiter (apache2 web server logs) and
double quotes for text area.
So I have configured my csv file format like this:

"csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "escape": "\\",
      "comment": "\u0000",
      "delimiter": " "
    }

and it doesn't work well.

A line look like this:
XXX.XXX.XXX.XXX 200 "GET / ... etc" "USER AGENT"

Instead of giving me (4 columns):
["XXX.XXX.XXX.XXX", "200", "GET / ... etc", "USER AGENT"]

I'm having this response (3columns):
["XXX.XXX.XXX.XXX", "200", "GET / ... etc\" \"USER AGENT\""]

But if I edit the file with comma delimiter a the configuration, it's
working fine.
Is there a problem within the code for space delimiter?

Thanks

Mathieu Agneray

Re: CSV Bug

Posted by Hsuan Yi Chu <hy...@maprtech.com>.
Hi Mathieu,
That issue has been resolved in Drill-1.2 snapshot.
(jira issue: https://issues.apache.org/jira/browse/DRILL-3718)

If you would like to try it out, you can download the source code from
github and build it. Or you could wait for the next official release :)

On Thu, Sep 17, 2015 at 7:16 AM, Jim Scott <js...@maprtech.com> wrote:

> While I am not going to tackle your specific question regarding using the
> delimited file reader, I will say that the 1.2 build of Drill has support
> for Apache HTTPd log format parsing. You only have to supply the format
> pattern that was used to create the logs and it will parse the records
> properly.
>
> On Thu, Sep 17, 2015 at 8:43 AM, Mathieu Agneray <
> mathieu.agneray@gmail.com>
> wrote:
>
> > Hy,
> >
> > I'm having an issue with Drill file format.
> > I have a CSV file that has space delimiter (apache2 web server logs) and
> > double quotes for text area.
> > So I have configured my csv file format like this:
> >
> > "csv": {
> >       "type": "text",
> >       "extensions": [
> >         "csv"
> >       ],
> >       "escape": "\\",
> >       "comment": "\u0000",
> >       "delimiter": " "
> >     }
> >
> > and it doesn't work well.
> >
> > A line look like this:
> > XXX.XXX.XXX.XXX 200 "GET / ... etc" "USER AGENT"
> >
> > Instead of giving me (4 columns):
> > ["XXX.XXX.XXX.XXX", "200", "GET / ... etc", "USER AGENT"]
> >
> > I'm having this response (3columns):
> > ["XXX.XXX.XXX.XXX", "200", "GET / ... etc\" \"USER AGENT\""]
> >
> > But if I edit the file with comma delimiter a the configuration, it's
> > working fine.
> > Is there a problem within the code for space delimiter?
> >
> > Thanks
> >
> > Mathieu Agneray
> >
>
>
>
> --
> *Jim Scott*
> Director, Enterprise Strategy & Architecture
> +1 (347) 746-9281
> @kingmesal <https://twitter.com/kingmesal>
>
> <http://www.mapr.com/>
> [image: MapR Technologies] <http://www.mapr.com>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: CSV Bug

Posted by Jim Scott <js...@maprtech.com>.
While I am not going to tackle your specific question regarding using the
delimited file reader, I will say that the 1.2 build of Drill has support
for Apache HTTPd log format parsing. You only have to supply the format
pattern that was used to create the logs and it will parse the records
properly.

On Thu, Sep 17, 2015 at 8:43 AM, Mathieu Agneray <ma...@gmail.com>
wrote:

> Hy,
>
> I'm having an issue with Drill file format.
> I have a CSV file that has space delimiter (apache2 web server logs) and
> double quotes for text area.
> So I have configured my csv file format like this:
>
> "csv": {
>       "type": "text",
>       "extensions": [
>         "csv"
>       ],
>       "escape": "\\",
>       "comment": "\u0000",
>       "delimiter": " "
>     }
>
> and it doesn't work well.
>
> A line look like this:
> XXX.XXX.XXX.XXX 200 "GET / ... etc" "USER AGENT"
>
> Instead of giving me (4 columns):
> ["XXX.XXX.XXX.XXX", "200", "GET / ... etc", "USER AGENT"]
>
> I'm having this response (3columns):
> ["XXX.XXX.XXX.XXX", "200", "GET / ... etc\" \"USER AGENT\""]
>
> But if I edit the file with comma delimiter a the configuration, it's
> working fine.
> Is there a problem within the code for space delimiter?
>
> Thanks
>
> Mathieu Agneray
>



-- 
*Jim Scott*
Director, Enterprise Strategy & Architecture
+1 (347) 746-9281
@kingmesal <https://twitter.com/kingmesal>

<http://www.mapr.com/>
[image: MapR Technologies] <http://www.mapr.com>

Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>