You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Nicolas Paris <ni...@gmail.com> on 2016/01/31 21:14:15 UTC

DRILL 1.4 - newline in strings not supported

Hello,

I am trying to import a csv containing large texts. They contains newline
character "\n".
Apache Drill conplains about that. There is a jira issue opened on
https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E&usg=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q&sig2=4EM_xXq2QWd8kmC3LT2-Wg

Is there a workaround ? (different that removing \n from texts)

Thanks by advance

Re: DRILL 1.4 - newline in strings not supported

Posted by Ted Dunning <te...@gmail.com>.
If you have new lines in your files then the files becomes unsuitable for splitting.  This means that the only parallelism available in a ctas statement is multiple files.  

Do you have a fair number of files?

Sent from my iPhone

> On Feb 1, 2016, at 7:26, Nicolas Paris <ni...@gmail.com> wrote:
> 
> Hello Abdel,
> 
> I am creating parquet file from those CSV files. (CREATE TABLE syntax).
> Basically, I have a text column, with a maximum of 50k characters,
> containing newlines (the texts come from pdf extracted). I have
> multimilions tuples of texts. I am subseting texts containing some patterns
> (LIKE '%foo%' or regex => sadly I haven't found mention about regex in
> documentation (postgresql "~" operator equivalent))
> Usually I used postgresql or monetdb in order to mine the texts, but I am
> benchmarking/studying apache drill too.
> 
> Thanks,
> 
> 
> 2016-02-01 15:54 GMT+01:00 Abdel Hakim Deneche <ad...@maprtech.com>:
> 
>> Hey Nicolas,
>> 
>> what kind of queries are you running on your csv file ?
>> 
>> On Sun, Jan 31, 2016 at 12:14 PM, Nicolas Paris <ni...@gmail.com>
>> wrote:
>> 
>>> Hello,
>>> 
>>> I am trying to import a csv containing large texts. They contains newline
>>> character "\n".
>>> Apache Drill conplains about that. There is a jira issue opened on
>> https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E&usg=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q&sig2=4EM_xXq2QWd8kmC3LT2-Wg
>>> 
>>> Is there a workaround ? (different that removing \n from texts)
>>> 
>>> Thanks by advance
>> 
>> 
>> 
>> --
>> 
>> Abdelhakim Deneche
>> 
>> Software Engineer
>> 
>>  <http://www.mapr.com/>
>> 
>> 
>> Now Available - Free Hadoop On-Demand Training
>> <
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>> 

Re: DRILL 1.4 - newline in strings not supported

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.
Another user already reported some problems querying csv files with new
line characters:

http://comments.gmane.org/gmane.comp.apache.incubator.drill.user/2350

His particular problem was related to a bug in the LIKE function.
Unfortunately he never got around to fill a JIRA for his issue.

Is your problem similar ? if yes, then can you please fill a JIRA ?

On Mon, Feb 1, 2016 at 7:26 AM, Nicolas Paris <ni...@gmail.com> wrote:

> Hello Abdel,
>
> I am creating parquet file from those CSV files. (CREATE TABLE syntax).
> Basically, I have a text column, with a maximum of 50k characters,
> containing newlines (the texts come from pdf extracted). I have
> multimilions tuples of texts. I am subseting texts containing some patterns
> (LIKE '%foo%' or regex => sadly I haven't found mention about regex in
> documentation (postgresql "~" operator equivalent))
> Usually I used postgresql or monetdb in order to mine the texts, but I am
> benchmarking/studying apache drill too.
>
> Thanks,
>
>
> 2016-02-01 15:54 GMT+01:00 Abdel Hakim Deneche <ad...@maprtech.com>:
>
> > Hey Nicolas,
> >
> > what kind of queries are you running on your csv file ?
> >
> > On Sun, Jan 31, 2016 at 12:14 PM, Nicolas Paris <ni...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I am trying to import a csv containing large texts. They contains
> newline
> > > character "\n".
> > > Apache Drill conplains about that. There is a jira issue opened on
> > >
> > >
> >
> https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E&usg=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q&sig2=4EM_xXq2QWd8kmC3LT2-Wg
> > >
> > > Is there a workaround ? (different that removing \n from texts)
> > >
> > > Thanks by advance
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: DRILL 1.4 - newline in strings not supported

Posted by Nicolas Paris <ni...@gmail.com>.
Hello Abdel,

I am creating parquet file from those CSV files. (CREATE TABLE syntax).
Basically, I have a text column, with a maximum of 50k characters,
containing newlines (the texts come from pdf extracted). I have
multimilions tuples of texts. I am subseting texts containing some patterns
(LIKE '%foo%' or regex => sadly I haven't found mention about regex in
documentation (postgresql "~" operator equivalent))
Usually I used postgresql or monetdb in order to mine the texts, but I am
benchmarking/studying apache drill too.

Thanks,


2016-02-01 15:54 GMT+01:00 Abdel Hakim Deneche <ad...@maprtech.com>:

> Hey Nicolas,
>
> what kind of queries are you running on your csv file ?
>
> On Sun, Jan 31, 2016 at 12:14 PM, Nicolas Paris <ni...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I am trying to import a csv containing large texts. They contains newline
> > character "\n".
> > Apache Drill conplains about that. There is a jira issue opened on
> >
> >
> https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E&usg=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q&sig2=4EM_xXq2QWd8kmC3LT2-Wg
> >
> > Is there a workaround ? (different that removing \n from texts)
> >
> > Thanks by advance
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: DRILL 1.4 - newline in strings not supported

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.
Hey Nicolas,

what kind of queries are you running on your csv file ?

On Sun, Jan 31, 2016 at 12:14 PM, Nicolas Paris <ni...@gmail.com> wrote:

> Hello,
>
> I am trying to import a csv containing large texts. They contains newline
> character "\n".
> Apache Drill conplains about that. There is a jira issue opened on
>
> https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E&usg=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q&sig2=4EM_xXq2QWd8kmC3LT2-Wg
>
> Is there a workaround ? (different that removing \n from texts)
>
> Thanks by advance
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>