You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by lewis john mcgibbney <le...@apache.org> on 2015/06/19 15:58:49 UTC

CSV Parser in Tika

Hi Folks,
Am I correct in saying that we can't detect CSV in Tika?
We import commons-csv in tika-parsers/pom.xml, however I don't see a csv
package and registered parser.
Also, when I use the webapp I get the following for a test csv file with
semicolon ';' separators

Content-Encoding: ISO-8859-1
Content-Length: 217
Content-Type: text/plain; charset=ISO-8859-1
X-Parsed-By: org.apache.tika.parser.DefaultParser
resourceName: test-semicolon.csv

Any comments please?
Thanks
Lewis

Re: CSV Parser in Tika

Posted by Chris Mattmann <ch...@gmail.com>.
Hey Tim, and Lewis,

My students and I did a Tika TSVParser and a JSONContentHandler
in my course a few semesters ago. I am going to whip it up
and contribute it back.

Cheers,
Chris

—
Chris Mattmann
chris.mattmann@gmail.com






-----Original Message-----
From: "Allison, Timothy B." <ta...@mitre.org>
Reply-To: <us...@tika.apache.org>
Date: Friday, June 19, 2015 at 7:27 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: RE: CSV Parser in Tika

>Y, that’s my belief.
> 
>As of now, we’re treating them as text files, which can lead to some
>really long = bogus tokens in Lucene/Solr with analyzers that don’t split
>on commas.
>L
> 
>Detection without filename would be difficult.
> 
> 
> 
> 
> 
>From: lewis john mcgibbney [mailto:lewismc@apache.org]
>
>Sent: Friday, June 19, 2015 9:59 AM
>To: user@tika.apache.org
>Subject: CSV Parser in Tika
> 
>Hi Folks,
>
>Am I correct in saying that we can't detect CSV in Tika?
>
>We import commons-csv in tika-parsers/pom.xml, however I don't see a csv
>package and registered parser.
>
>Also, when I use the webapp I get the following for a test csv file with
>semicolon ';' separators
>
>Content-Encoding: ISO-8859-1
>Content-Length: 217
>Content-Type: text/plain; charset=ISO-8859-1
>X-Parsed-By: org.apache.tika.parser.DefaultParser
>resourceName: test-semicolon.csv
>
>Any comments please?
>
>Thanks
>
>Lewis
> 
>
>
>
>
>
>
>
>



RE: CSV Parser in Tika

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Y, that’s my belief.

As of now, we’re treating them as text files, which can lead to some really long = bogus tokens in Lucene/Solr with analyzers that don’t split on commas. ☹

Detection without filename would be difficult.





From: lewis john mcgibbney [mailto:lewismc@apache.org]
Sent: Friday, June 19, 2015 9:59 AM
To: user@tika.apache.org
Subject: CSV Parser in Tika

Hi Folks,
Am I correct in saying that we can't detect CSV in Tika?
We import commons-csv in tika-parsers/pom.xml, however I don't see a csv package and registered parser.
Also, when I use the webapp I get the following for a test csv file with semicolon ';' separators

Content-Encoding: ISO-8859-1
Content-Length: 217
Content-Type: text/plain; charset=ISO-8859-1
X-Parsed-By: org.apache.tika.parser.DefaultParser
resourceName: test-semicolon.csv
Any comments please?
Thanks
Lewis