You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oodt.apache.org by Chris Mattmann <ma...@apache.org> on 2013/10/09 16:59:42 UTC

Re: Problems with regex in mimetypes.xml for cas-crawler.

Hey Konstantinos,

Thanks for the email.

You can try the PushPull's regex URL validator to test your regex:

In your PushPull deployment bin dir, run this command:


java -Djava.ext.dirs=../lib
org.apache.oodt.cas.pushpull.util.ExpressionValidator

Then a GUI will pop up and you can test it out. If it succeeds there, but
still doesn't work in the below Tika file, let me know and we'll debug it
in Tika.

Cheers,
Chris

-----Original Message-----
From: Konstantinos Mavromatis <ma...@gmail.com>
Date: Tuesday, October 8, 2013 11:43 PM
To: "user@oodt.apache.org" <us...@oodt.apache.org>, jpluser
<ch...@jpl.nasa.gov>
Subject: Problems with regex in mimetypes.xml for cas-crawler.

>Hi,
>I have setup the crawler and I am trying to detect the type of the files
>that are ingested based on their filename.
>I have succeeded to ingest files when I define the filename pattern in
>the mimetypes.xml, but I did not have any luck when I try to use regular
>expressions. 
>
>
>The following mimetypes.xml works and ingest the files in the database
>properly:
>
>
><?xml version="1.0" encoding="UTF-8"?>
>
><mime-info>
>        <mime-type type="text/fastqFirstMateRead">
>                <glob pattern="*_R1.fastq.gz" />
>                <sub-class-of type="text/fastq"/>
>        </mime-type>
>        <mime-type type="text/fastq">
>                <glob pattern="*.fastq.gz"/>
>          </mime-type>
></mime-info>
>
>
>
>
>
>while the following does not
><?xml version="1.0" encoding="UTF-8"?>
>
><mime-info>
>        <mime-type type="text/fastqFirstMateRead">
>                <glob pattern=".*_R1\.fastq.gz$" isregex="true" />
>                <sub-class-of type="text/fastq"/>
>        </mime-type>
>        <mime-type type="text/fastq">
>                <glob pattern="*.fastq.gz"/>
>        </mime-type>
></mime-info>
>
>
>
>
>
>The command I am using to ingest the files is:
>./crawler_launcher --operation --launchAutoCrawler --productPath
>$FILEPATH --filemgrUrl $FLMGR_URL --clientTransferer
>org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory
>--mimeExtractorRepo ../policy/mime-extractor-map.xml
>
>
>
>Any idea what I am doing wrong?
>Thanks
>K
>