You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by lewis john mcgibbney <le...@apache.org> on 2015/06/19 15:58:49 UTC
CSV Parser in Tika
Hi Folks,
Am I correct in saying that we can't detect CSV in Tika?
We import commons-csv in tika-parsers/pom.xml, however I don't see a csv
package and registered parser.
Also, when I use the webapp I get the following for a test csv file with
semicolon ';' separators
Content-Encoding: ISO-8859-1
Content-Length: 217
Content-Type: text/plain; charset=ISO-8859-1
X-Parsed-By: org.apache.tika.parser.DefaultParser
resourceName: test-semicolon.csv
Any comments please?
Thanks
Lewis
Re: CSV Parser in Tika
Posted by Chris Mattmann <ch...@gmail.com>.
Hey Tim, and Lewis,
My students and I did a Tika TSVParser and a JSONContentHandler
in my course a few semesters ago. I am going to whip it up
and contribute it back.
Cheers,
Chris
—
Chris Mattmann
chris.mattmann@gmail.com
-----Original Message-----
From: "Allison, Timothy B." <ta...@mitre.org>
Reply-To: <us...@tika.apache.org>
Date: Friday, June 19, 2015 at 7:27 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: RE: CSV Parser in Tika
>Y, that’s my belief.
>
>As of now, we’re treating them as text files, which can lead to some
>really long = bogus tokens in Lucene/Solr with analyzers that don’t split
>on commas.
>L
>
>Detection without filename would be difficult.
>
>
>
>
>
>From: lewis john mcgibbney [mailto:lewismc@apache.org]
>
>Sent: Friday, June 19, 2015 9:59 AM
>To: user@tika.apache.org
>Subject: CSV Parser in Tika
>
>Hi Folks,
>
>Am I correct in saying that we can't detect CSV in Tika?
>
>We import commons-csv in tika-parsers/pom.xml, however I don't see a csv
>package and registered parser.
>
>Also, when I use the webapp I get the following for a test csv file with
>semicolon ';' separators
>
>Content-Encoding: ISO-8859-1
>Content-Length: 217
>Content-Type: text/plain; charset=ISO-8859-1
>X-Parsed-By: org.apache.tika.parser.DefaultParser
>resourceName: test-semicolon.csv
>
>Any comments please?
>
>Thanks
>
>Lewis
>
>
>
>
>
>
>
>
>
RE: CSV Parser in Tika
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Y, that’s my belief.
As of now, we’re treating them as text files, which can lead to some really long = bogus tokens in Lucene/Solr with analyzers that don’t split on commas. ☹
Detection without filename would be difficult.
From: lewis john mcgibbney [mailto:lewismc@apache.org]
Sent: Friday, June 19, 2015 9:59 AM
To: user@tika.apache.org
Subject: CSV Parser in Tika
Hi Folks,
Am I correct in saying that we can't detect CSV in Tika?
We import commons-csv in tika-parsers/pom.xml, however I don't see a csv package and registered parser.
Also, when I use the webapp I get the following for a test csv file with semicolon ';' separators
Content-Encoding: ISO-8859-1
Content-Length: 217
Content-Type: text/plain; charset=ISO-8859-1
X-Parsed-By: org.apache.tika.parser.DefaultParser
resourceName: test-semicolon.csv
Any comments please?
Thanks
Lewis