You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oodt.apache.org by Konstantinos Mavrommatis <km...@celgene.com> on 2014/10/08 08:31:48 UTC

How to ingest files when metadata contain non standard characters?

Hi,

I am trying to ingest a large number of files. The metadata for these files exist in .met files.

Many of the metadata fields contain characters like '<>&$' etc.

Running crawler on these metadata results in failure.

When I try to escape the characters using HTML encode e.g. '>' becomes &gt etc I still get errors and the crawler cannot ingest the files.



Here is an example of the offending lines in the .met file before and after HTML encoding

<val>sailfish quant --index /reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex --libtype 'T=PE:O=><:S=AS' -1 <(gunzip -c /gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R1.fastq.gz) -2 <(gunzip -c /gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R2.fastq.gz) -o /gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-transcriptCounts/HP1_3.Sailfish.txt -p 8  --no_bias_correct  </val>





<val>sailfish quant --index /reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex --libtype &#39;T=PE:O=&gt;&lt;:S=AS&#39; -1 &lt;(gunzip -c /gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R1.fastq.gz) -2 &lt;(gunzip -c /gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R2.fastq.gz) -o /gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-transcriptCounts/HP1_3.Sailfish.txt -p 8  --no_bias_correct  </val>



If I remove the offending characters ( in this case '<>') the ingestion goes one without any issues



The crawler command is :

./crawler_launcher --operation --launchAutoCrawler --productPath $FILEPATH --filemgrUrl $OODT_FILEMGR_URL --clientTransferer org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory  --mimeExtractorRe

po ../policy/mime-extractor-map.xml --noRecur --crawlForDirs



The error message I get when I run the crawler is:
INFO: StdIngester: ingesting product: ProductName: [A1_1.Sailfish.sfish]: ProductType: [GenericFile]: FileLocation: [/datavault/RNA-Seq/Processed/Sailfish-transcriptCounts/]

org.apache.xmlrpc.XmlRpcException: java.lang.Exception: org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP method failed: HTTP/1.1 400 Bad Request

      at org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeException(XmlRpcClientResponseProcessor.java:104)

      at org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse(XmlRpcClientResponseProcessor.java:71)

      at org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorker.java:73)

      at org.apache.xmlrpc.XmlRpcClient.execute(XmlRpcClient.java:194)

      at org.apache.xmlrpc.XmlRpcClient.execute(XmlRpcClient.java:185)

      at org.apache.xmlrpc.XmlRpcClient.execute(XmlRpcClient.java:178)

      at org.apache.oodt.cas.filemgr.system.XmlRpcFileManagerClient.ingestProduct(XmlRpcFileManagerClient.java:1178)

      at org.apache.oodt.cas.filemgr.ingest.StdIngester.ingest(StdIngester.java:199)

      at org.apache.oodt.cas.crawl.ProductCrawler.ingest(ProductCrawler.java:304)

      at org.apache.oodt.cas.crawl.ProductCrawler.handleFile(ProductCrawler.java:188)

      at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:108)

      at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:75)

      at org.apache.oodt.cas.crawl.cli.action.CrawlerLauncherCliAction.execute(CrawlerLauncherCliAction.java:58)

      at org.apache.oodt.cas.cli.CmdLineUtility.execute(CmdLineUtility.java:331)

      at org.apache.oodt.cas.cli.CmdLineUtility.run(CmdLineUtility.java:187)

      at org.apache.oodt.cas.crawl.CrawlerLauncher.main(CrawlerLauncher.java:36)

Oct 07, 2014 11:17:18 PM org.apache.oodt.cas.filemgr.system.XmlRpcFileManagerClient ingestProduct

SEVERE: Failed to ingest product [org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] : java.lang.Exception: org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP method failed: HTTP/1.1 400 Bad Request -- rolling back ingest

java.lang.Exception: Failed to ingest product [org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] : java.lang.Exception: org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP method failed: HTTP/1.1 400 Bad Request

      at org.apache.oodt.cas.filemgr.system.XmlRpcFileManagerClient.ingestProduct(XmlRpcFileManagerClient.java:1279)

      at org.apache.oodt.cas.filemgr.ingest.StdIngester.ingest(StdIngester.java:199)

      at org.apache.oodt.cas.crawl.ProductCrawler.ingest(ProductCrawler.java:304)

      at org.apache.oodt.cas.crawl.ProductCrawler.handleFile(ProductCrawler.java:188)

      at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:108)

      at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:75)

      at org.apache.oodt.cas.crawl.cli.action.CrawlerLauncherCliAction.execute(CrawlerLauncherCliAction.java:58)

      at org.apache.oodt.cas.cli.CmdLineUtility.execute(CmdLineUtility.java:331)

      at org.apache.oodt.cas.cli.CmdLineUtility.run(CmdLineUtility.java:187)

      at org.apache.oodt.cas.crawl.CrawlerLauncher.main(CrawlerLauncher.java:36)

Oct 07, 2014 11:17:18 PM org.apache.oodt.cas.filemgr.ingest.StdIngester ingest

WARNING: exception ingesting product: [A1_1.Sailfish.sfish]: Message: Failed to ingest product [org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] : java.lang.Exception: org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP method failed: HTTP/1.1 400 Bad Request

Oct 07, 2014 11:17:18 PM org.apache.oodt.cas.crawl.ProductCrawler ingest

WARNING: ProductCrawler: Exception ingesting product: [/datavault/RNA-Seq/Processed/Sailfish-transcriptCounts/A1_1.Sailfish.sfish]: Message: exception ingesting product: [A1_1.Sailfish.sfish]: Message: Failed to ingest product [org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] : java.lang.Exception: org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP method failed: HTTP/1.1 400 Bad Request: attempting to continue crawling

org.apache.oodt.cas.filemgr.structs.exceptions.IngestException: exception ingesting product: [A1_1.Sailfish.sfish]: Message: Failed to ingest product [org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] : java.lang.Exception: org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP method failed: HTTP/1.1 400 Bad Request

      at org.apache.oodt.cas.filemgr.ingest.StdIngester.ingest(StdIngester.java:204)

      at org.apache.oodt.cas.crawl.ProductCrawler.ingest(ProductCrawler.java:304)

      at org.apache.oodt.cas.crawl.ProductCrawler.handleFile(ProductCrawler.java:188)

      at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:108)

      at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:75)

      at org.apache.oodt.cas.crawl.cli.action.CrawlerLauncherCliAction.execute(CrawlerLauncherCliAction.java:58)

      at org.apache.oodt.cas.cli.CmdLineUtility.execute(CmdLineUtility.java:331)

      at org.apache.oodt.cas.cli.CmdLineUtility.run(CmdLineUtility.java:187)

      at org.apache.oodt.cas.crawl.CrawlerLauncher.main(CrawlerLauncher.java:36)



Oct 07, 2014 11:17:18 PM org.apache.oodt.cas.crawl.ProductCrawler handleFile

WARNING: Failed to ingest product: [/datavault/RNA-Seq/Processed/Sailfish-transcriptCounts/A1_1.Sailfish.sfish]: performing postIngestFail actions



Any ideas how I can ingest these files?

Thanks
K


*********************************************************
THIS ELECTRONIC MAIL MESSAGE AND ANY ATTACHMENT IS
CONFIDENTIAL AND MAY CONTAIN LEGALLY PRIVILEGED
INFORMATION INTENDED ONLY FOR THE USE OF THE INDIVIDUAL
OR INDIVIDUALS NAMED ABOVE.
If the reader is not the intended recipient, or the
employee or agent responsible to deliver it to the
intended recipient, you are hereby notified that any
dissemination, distribution or copying of this
communication is strictly prohibited. If you have
received this communication in error, please reply to the
sender to notify us of the error and delete the original
message. Thank You.

Re: How to ingest files when metadata contain non standard characters?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Thanks Kos,
Please see
https://issues.apache.org/jira/browse/OODT-759
We will track it there from now on and determine what needs to be done.

On Wed, Oct 8, 2014 at 8:55 PM, Konstantinos Mavrommatis <
kmavrommatis@celgene.com> wrote:

> Here is the offending file before escape:
>
>
>
> <cas:metadata xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
>         <keyval>
>                 <key>derived_from</key>
>
> <val>/gpfs/celgene/reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex</val>
>
> <val>/gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HM1_1_R1.fastq.gz</val>
>
> <val>/gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HM1_1_R2.fastq.gz</val>
>         </keyval>
>         <keyval>
>                 <key>FilePath</key>
>
> <val>/gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-transcriptCounts/HM1_1.Sailfish.sfish</val>
>         </keyval>
>         <keyval>
>                 <key>start_execution</key>
>                 <val>Tue Oct  7 20:49:12 2014</val>
>         </keyval>
>         <keyval>
>                 <key>ingest_user</key>
>                 <val>kmavrommatis</val>
>         </keyval>
>         <keyval>
>                 <key>end_execution</key>
>                 <val>Tue Oct  7 21:03:47 2014</val>
>         </keyval>
>         <keyval>
>                 <key>run_user</key>
>                 <val>kmavrommatis</val>
>         </keyval>
>         <keyval>
>                 <key>file_host</key>
>                 <val>ussdgsphpccas02</val>
>         </keyval>
>         <keyval>
>                 <key>generator</key>
>                 <val>sailfish</val>
>         </keyval>
>         <keyval>
>                 <key>run_host</key>
>                 <val>ussdgsphpccmp01</val>
>         </keyval>
>         <keyval>
>                 <key>sample_id</key>
>                 <val>2569</val>
>         </keyval>
>         <keyval>
>                 <key>generator_version</key>
>                 <val>sailfish[0.6.3]</val>
>         </keyval>
>         <keyval>
>                 <key>ProductType</key>
>                 <val>GenericFile</val>
>         </keyval>
>         <keyval>
>                 <key>analysis_task</key>
>                 <val>38</val>
>         </keyval>
>         <keyval>
>                 <key>generator_string</key>
>                 <val>"sailfish quant --index
> /gpfs/celgene/reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex --libtype
> 'T=PE:O=><:S=AS' -1 <(gunzip -c
> /gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HM1_1_R1.fastq.gz)
> -2 <(gunzip -c
> /gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HM1_1_R2.fastq.gz)
> -o
> /gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-transcriptCounts/HM1_1.Sailfish.txt
> -p 8  --no_bias_correct "</val>
>         </keyval>
> </cas:metadata>
>
> *********************************************************
> THIS ELECTRONIC MAIL MESSAGE AND ANY ATTACHMENT IS
> CONFIDENTIAL AND MAY CONTAIN LEGALLY PRIVILEGED
> INFORMATION INTENDED ONLY FOR THE USE OF THE INDIVIDUAL
> OR INDIVIDUALS NAMED ABOVE.
> If the reader is not the intended recipient, or the
> employee or agent responsible to deliver it to the
> intended recipient, you are hereby notified that any
> dissemination, distribution or copying of this
> communication is strictly prohibited. If you have
> received this communication in error, please reply to the
> sender to notify us of the error and delete the original
> message. Thank You.
>



-- 
*Lewis*

RE: How to ingest files when metadata contain non standard characters?

Posted by Konstantinos Mavrommatis <km...@celgene.com>.
Here is the offending file before escape:



<cas:metadata xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
	<keyval>
		<key>derived_from</key>
		<val>/gpfs/celgene/reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex</val>
		<val>/gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HM1_1_R1.fastq.gz</val>
		<val>/gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HM1_1_R2.fastq.gz</val>
	</keyval>
	<keyval>
		<key>FilePath</key>
		<val>/gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-transcriptCounts/HM1_1.Sailfish.sfish</val>
	</keyval>
	<keyval>
		<key>start_execution</key>
		<val>Tue Oct  7 20:49:12 2014</val>
	</keyval>
	<keyval>
		<key>ingest_user</key>
		<val>kmavrommatis</val>
	</keyval>
	<keyval>
		<key>end_execution</key>
		<val>Tue Oct  7 21:03:47 2014</val>
	</keyval>
	<keyval>
		<key>run_user</key>
		<val>kmavrommatis</val>
	</keyval>
	<keyval>
		<key>file_host</key>
		<val>ussdgsphpccas02</val>
	</keyval>
	<keyval>
		<key>generator</key>
		<val>sailfish</val>
	</keyval>
	<keyval>
		<key>run_host</key>
		<val>ussdgsphpccmp01</val>
	</keyval>
	<keyval>
		<key>sample_id</key>
		<val>2569</val>
	</keyval>
	<keyval>
		<key>generator_version</key>
		<val>sailfish[0.6.3]</val>
	</keyval>
	<keyval>
		<key>ProductType</key>
		<val>GenericFile</val>
	</keyval>
	<keyval>
		<key>analysis_task</key>
		<val>38</val>
	</keyval>
	<keyval>
		<key>generator_string</key>
		<val>"sailfish quant --index /gpfs/celgene/reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex --libtype 'T=PE:O=><:S=AS' -1 <(gunzip -c /gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HM1_1_R1.fastq.gz) -2 <(gunzip -c /gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HM1_1_R2.fastq.gz) -o /gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-transcriptCounts/HM1_1.Sailfish.txt -p 8  --no_bias_correct "</val>
	</keyval>
</cas:metadata>

*********************************************************
THIS ELECTRONIC MAIL MESSAGE AND ANY ATTACHMENT IS
CONFIDENTIAL AND MAY CONTAIN LEGALLY PRIVILEGED
INFORMATION INTENDED ONLY FOR THE USE OF THE INDIVIDUAL
OR INDIVIDUALS NAMED ABOVE.
If the reader is not the intended recipient, or the
employee or agent responsible to deliver it to the
intended recipient, you are hereby notified that any
dissemination, distribution or copying of this
communication is strictly prohibited. If you have
received this communication in error, please reply to the
sender to notify us of the error and delete the original
message. Thank You.

Re: How to ingest files when metadata contain non standard characters?

Posted by Chris Mattmann <ch...@gmail.com>.
Thanks Kostas. Can you upload somewhere and then point here, the
message list strips attachments..

Cheers,
Chris

------------------------
Chris Mattmann
chris.mattmann@gmail.com




-----Original Message-----
From: Konstantinos Mavrommatis <km...@celgene.com>
Reply-To: <de...@oodt.apache.org>
Date: Thursday, October 9, 2014 at 5:48 AM
To: "dev@oodt.apache.org" <de...@oodt.apache.org>
Subject: RE: How to ingest files when metadata contain non standard
characters?

>Thanks Chris,
>
>attached is an offending file before escape.
>For the record perl module HTML::Entities does provide an escapeHTML
>alternative that produces acceptable files.
>
>Thanks
>K
>
>
>> -----Original Message-----
>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>> Sent: Wednesday, October 08, 2014 11:38 AM
>> To: dev@oodt.apache.org
>> Subject: Re: How to ingest files when metadata contain non standard
>> characters?
>> 
>> cas-metadata should handle this escaping/unescaping in its SerDe
>> capabilities.
>> 
>> Kostsas, can yo provide the exact file that I can test on and upload it
>> to JIRA?
>> 
>> ------------------------
>> Chris Mattmann
>> chris.mattmann@gmail.com
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Lewis John Mcgibbney <le...@gmail.com>
>> Reply-To: <de...@oodt.apache.org>
>> Date: Thursday, October 9, 2014 at 2:59 AM
>> To: "dev@oodt.apache.org" <de...@oodt.apache.org>
>> Subject: Re: How to ingest files when metadata contain non standard
>> characters?
>> 
>> >Hi Kos,
>> >Thanks for reply
>> >
>> >On Wed, Oct 8, 2014 at 5:16 PM, Konstantinos Mavrommatis <
>> >kmavrommatis@celgene.com> wrote:
>> >
>> >> I escaped the characters using the CGI::escapeHTML function from the
>> >> CGI perl module.
>> >>
>> >
>> >Wow. I am surpised at this one. I wonder if this is a bug which
>> results
>> >in the discrepancy or if this is intential behaviour!
>> >
>> >
>> >>
>> >> The differences between the two versions (mine escaped vs yours
>> >>escaped)  is in the encoding of the single quote "'" character, if I
>> >>am not mistaken.
>> >> I want to clarify this because your email come as simple ASCII (not
>> >>HTML)
>> >>
>> >
>> >Yes that is correct.
>> >
>> >
>> >>
>> >> I did try your command and it worked !!!
>> >>
>> >
>> >OK grand.
>> >
>> >
>> >>
>> >> Now the question is how to do this encoding (your version) ☺
>> >>
>> >>
>> >Is this the question? My thoughts would be that this should be
>> >encapsulated within OODT somewhere and that it should not be necessary
>> >to escape everything as you/we have been doing. This is extremely time
>> >consuming and painful.
>> >
>> >I escaped everything here
>> >http://www.freeformatter.com/html-escape.html
>> >
>> >and compared the strings here
>> >http://text-compare.com/
>> >
>> >The latter resource will verify that it is the single quote that is
>> the
>> >offending char here.
>> >Thanks
>> >Lewis
>> 
>
>*********************************************************
>THIS ELECTRONIC MAIL MESSAGE AND ANY ATTACHMENT IS
>CONFIDENTIAL AND MAY CONTAIN LEGALLY PRIVILEGED
>INFORMATION INTENDED ONLY FOR THE USE OF THE INDIVIDUAL
>OR INDIVIDUALS NAMED ABOVE.
>If the reader is not the intended recipient, or the
>employee or agent responsible to deliver it to the
>intended recipient, you are hereby notified that any
>dissemination, distribution or copying of this
>communication is strictly prohibited. If you have
>received this communication in error, please reply to the
>sender to notify us of the error and delete the original
>message. Thank You.



RE: How to ingest files when metadata contain non standard characters?

Posted by Konstantinos Mavrommatis <km...@celgene.com>.
Thanks Chris,

attached is an offending file before escape.
For the record perl module HTML::Entities does provide an escapeHTML alternative that produces acceptable files.

Thanks
K


> -----Original Message-----
> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
> Sent: Wednesday, October 08, 2014 11:38 AM
> To: dev@oodt.apache.org
> Subject: Re: How to ingest files when metadata contain non standard
> characters?
> 
> cas-metadata should handle this escaping/unescaping in its SerDe
> capabilities.
> 
> Kostsas, can yo provide the exact file that I can test on and upload it
> to JIRA?
> 
> ------------------------
> Chris Mattmann
> chris.mattmann@gmail.com
> 
> 
> 
> 
> -----Original Message-----
> From: Lewis John Mcgibbney <le...@gmail.com>
> Reply-To: <de...@oodt.apache.org>
> Date: Thursday, October 9, 2014 at 2:59 AM
> To: "dev@oodt.apache.org" <de...@oodt.apache.org>
> Subject: Re: How to ingest files when metadata contain non standard
> characters?
> 
> >Hi Kos,
> >Thanks for reply
> >
> >On Wed, Oct 8, 2014 at 5:16 PM, Konstantinos Mavrommatis <
> >kmavrommatis@celgene.com> wrote:
> >
> >> I escaped the characters using the CGI::escapeHTML function from the
> >> CGI perl module.
> >>
> >
> >Wow. I am surpised at this one. I wonder if this is a bug which
> results
> >in the discrepancy or if this is intential behaviour!
> >
> >
> >>
> >> The differences between the two versions (mine escaped vs yours
> >>escaped)  is in the encoding of the single quote "'" character, if I
> >>am not mistaken.
> >> I want to clarify this because your email come as simple ASCII (not
> >>HTML)
> >>
> >
> >Yes that is correct.
> >
> >
> >>
> >> I did try your command and it worked !!!
> >>
> >
> >OK grand.
> >
> >
> >>
> >> Now the question is how to do this encoding (your version) ☺
> >>
> >>
> >Is this the question? My thoughts would be that this should be
> >encapsulated within OODT somewhere and that it should not be necessary
> >to escape everything as you/we have been doing. This is extremely time
> >consuming and painful.
> >
> >I escaped everything here
> >http://www.freeformatter.com/html-escape.html
> >
> >and compared the strings here
> >http://text-compare.com/
> >
> >The latter resource will verify that it is the single quote that is
> the
> >offending char here.
> >Thanks
> >Lewis
> 


*********************************************************
THIS ELECTRONIC MAIL MESSAGE AND ANY ATTACHMENT IS
CONFIDENTIAL AND MAY CONTAIN LEGALLY PRIVILEGED
INFORMATION INTENDED ONLY FOR THE USE OF THE INDIVIDUAL
OR INDIVIDUALS NAMED ABOVE.
If the reader is not the intended recipient, or the
employee or agent responsible to deliver it to the
intended recipient, you are hereby notified that any
dissemination, distribution or copying of this
communication is strictly prohibited. If you have
received this communication in error, please reply to the
sender to notify us of the error and delete the original
message. Thank You.

Re: How to ingest files when metadata contain non standard characters?

Posted by Chris Mattmann <ch...@gmail.com>.
cas-metadata should handle this escaping/unescaping in its
SerDe capabilities.

Kostsas, can yo provide the exact file that I can test on and upload it to
JIRA?

------------------------
Chris Mattmann
chris.mattmann@gmail.com




-----Original Message-----
From: Lewis John Mcgibbney <le...@gmail.com>
Reply-To: <de...@oodt.apache.org>
Date: Thursday, October 9, 2014 at 2:59 AM
To: "dev@oodt.apache.org" <de...@oodt.apache.org>
Subject: Re: How to ingest files when metadata contain non standard
characters?

>Hi Kos,
>Thanks for reply
>
>On Wed, Oct 8, 2014 at 5:16 PM, Konstantinos Mavrommatis <
>kmavrommatis@celgene.com> wrote:
>
>> I escaped the characters using the CGI::escapeHTML function from the CGI
>> perl module.
>>
>
>Wow. I am surpised at this one. I wonder if this is a bug which results in
>the discrepancy or if this is intential behaviour!
>
>
>>
>> The differences between the two versions (mine escaped vs yours escaped)
>> is in the encoding of the single quote "'" character, if I am not
>>mistaken.
>> I want to clarify this because your email come as simple ASCII (not
>>HTML)
>>
>
>Yes that is correct.
>
>
>>
>> I did try your command and it worked !!!
>>
>
>OK grand.
>
>
>>
>> Now the question is how to do this encoding (your version) ☺
>>
>>
>Is this the question? My thoughts would be that this should be
>encapsulated
>within OODT somewhere and that it should not be necessary to escape
>everything as you/we have been doing. This is extremely time consuming and
>painful.
>
>I escaped everything here
>http://www.freeformatter.com/html-escape.html
>
>and compared the strings here
>http://text-compare.com/
>
>The latter resource will verify that it is the single quote that is the
>offending char here.
>Thanks
>Lewis



Re: How to ingest files when metadata contain non standard characters?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
In addition, if you can get to the bottom of what you think the intended
behaviour is here, please feel free to log a ticket in Jira
https://issues.apache.org/jira/browse/OODT/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel

On Wed, Oct 8, 2014 at 5:59 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Kos,
> Thanks for reply
>
> On Wed, Oct 8, 2014 at 5:16 PM, Konstantinos Mavrommatis <
> kmavrommatis@celgene.com> wrote:
>
>> I escaped the characters using the CGI::escapeHTML function from the CGI
>> perl module.
>>
>
> Wow. I am surpised at this one. I wonder if this is a bug which results in
> the discrepancy or if this is intential behaviour!
>
>
>>
>> The differences between the two versions (mine escaped vs yours escaped)
>> is in the encoding of the single quote "'" character, if I am not mistaken.
>> I want to clarify this because your email come as simple ASCII (not HTML)
>>
>
> Yes that is correct.
>
>
>>
>> I did try your command and it worked !!!
>>
>
> OK grand.
>
>
>>
>> Now the question is how to do this encoding (your version) ☺
>>
>>
> Is this the question? My thoughts would be that this should be
> encapsulated within OODT somewhere and that it should not be necessary to
> escape everything as you/we have been doing. This is extremely time
> consuming and painful.
>
> I escaped everything here
> http://www.freeformatter.com/html-escape.html
>
> and compared the strings here
> http://text-compare.com/
>
> The latter resource will verify that it is the single quote that is the
> offending char here.
> Thanks
> Lewis
>



-- 
*Lewis*

Re: How to ingest files when metadata contain non standard characters?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Kos,
Thanks for reply

On Wed, Oct 8, 2014 at 5:16 PM, Konstantinos Mavrommatis <
kmavrommatis@celgene.com> wrote:

> I escaped the characters using the CGI::escapeHTML function from the CGI
> perl module.
>

Wow. I am surpised at this one. I wonder if this is a bug which results in
the discrepancy or if this is intential behaviour!


>
> The differences between the two versions (mine escaped vs yours escaped)
> is in the encoding of the single quote "'" character, if I am not mistaken.
> I want to clarify this because your email come as simple ASCII (not HTML)
>

Yes that is correct.


>
> I did try your command and it worked !!!
>

OK grand.


>
> Now the question is how to do this encoding (your version) ☺
>
>
Is this the question? My thoughts would be that this should be encapsulated
within OODT somewhere and that it should not be necessary to escape
everything as you/we have been doing. This is extremely time consuming and
painful.

I escaped everything here
http://www.freeformatter.com/html-escape.html

and compared the strings here
http://text-compare.com/

The latter resource will verify that it is the single quote that is the
offending char here.
Thanks
Lewis

RE: How to ingest files when metadata contain non standard characters?

Posted by Konstantinos Mavrommatis <km...@celgene.com>.
Hi Lewis

I escaped the characters using the CGI::escapeHTML function from the CGI perl module.

The differences between the two versions (mine escaped vs yours escaped) is in the encoding of the single quote "'" character, if I am not mistaken. I want to clarify this because your email come as simple ASCII (not HTML)



I did try your command and it worked !!!

Now the question is how to do this encoding (your version) ☺

Thanks

K



> -----Original Message-----

> From: Lewis John Mcgibbney [mailto:lewis.mcgibbney@gmail.com]

> Sent: Wednesday, October 08, 2014 1:43 PM

> To: dev@oodt.apache.org

> Subject: Re: How to ingest files when metadata contain non standard

> characters?

>

> Hi Kos,

> I take you up on your challenge ;) However I don't know if this will

> fix it.

>

> On Tue, Oct 7, 2014 at 11:31 PM, Konstantinos Mavrommatis <

> kmavrommatis@celgene.com<ma...@celgene.com>> wrote:

>

> >

> > <val>sailfish quant --index

> > /reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex --libtype

> > 'T=PE:O=><:S=AS' -1 <(gunzip -c

> > /gpfs/archive/RED/DA0000072/RNA-

> Seq/RawData/FastqFiles/HP1_3_R1.fastq.

> > gz)

> > -2 <(gunzip -c

> > /gpfs/archive/RED/DA0000072/RNA-

> Seq/RawData/FastqFiles/HP1_3_R2.fastq.

> > gz)

> > -o

> > /gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-

> transcriptCount

> > s/HP1_3.Sailfish.txt

> > -p 8  --no_bias_correct  </val>

> >

>

> OK, the code above is what you intially pasted...

>

>

>

> >

> > <val>sailfish quant --index

> > /reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex --libtype

> > *&#39;T=PE:O=&gt;&lt;:S=AS&#39;* -1 &lt;(gunzip -c

> > /gpfs/archive/RED/DA0000072/RNA-

> Seq/RawData/FastqFiles/HP1_3_R1.fastq.

> > gz)

> > -2 &lt;(gunzip -c

> > /gpfs/archive/RED/DA0000072/RNA-

> Seq/RawData/FastqFiles/HP1_3_R2.fastq.

> > gz)

> > -o

> > /gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-

> transcriptCount

> > s/HP1_3.Sailfish.txt

> > -p 8  --no_bias_correct  </val>

> >

>

>

> The code above is what you pasted once you had escaped everything. Did

> you do this manually? I get a different output which I've pased below

>

>

> 1sailfish quant --index /reference/v1/Homo-

> sapiens/GRCh37.p12/SailFishIndex

> --libtype *'T=PE:O=&gt;&lt;:S=AS'* -1 &lt;(gunzip -c

> /gpfs/archive/RED/DA0000072/RNA-

> Seq/RawData/FastqFiles/HP1_3_R1.fastq.gz)

> -2 &lt;(gunzip -c

> /gpfs/archive/RED/DA0000072/RNA-

> Seq/RawData/FastqFiles/HP1_3_R2.fastq.gz)

> -o

> /gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-

> transcriptCounts/HP1_3.Sailfish.txt

> -p 8  --no_bias_correct

> Please notice the difference in the part which I have boldened. Can you

> try reingesting and see if your come up donald trumps?

>

>

>

> >

> > org.apache.oodt.cas.filemgr.structs.exceptions.IngestException:

> > exception ingesting product: [A1_1.Sailfish.sfish]: Message: Failed

> to

> > ingest product [org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b]

> :

> > java.lang.Exception:

> > org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException:

> Error

> > ingesting product

> > [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87]

> > : HTTP method failed: HTTP/1.1 400 Bad Request

> >

>

> BTW, you also have AGAIN highlighted the horrible opaque Product

> objects we get as Exception output. I logged an issue for this last

> week.

> https://issues.apache.org/jira/browse/OODT-755

> We need to fix this and I will try my damdest to hack it at the

> weekend.

> Thanks

> Lewis

*********************************************************
THIS ELECTRONIC MAIL MESSAGE AND ANY ATTACHMENT IS
CONFIDENTIAL AND MAY CONTAIN LEGALLY PRIVILEGED
INFORMATION INTENDED ONLY FOR THE USE OF THE INDIVIDUAL
OR INDIVIDUALS NAMED ABOVE.
If the reader is not the intended recipient, or the
employee or agent responsible to deliver it to the
intended recipient, you are hereby notified that any
dissemination, distribution or copying of this
communication is strictly prohibited. If you have
received this communication in error, please reply to the
sender to notify us of the error and delete the original
message. Thank You.

Re: How to ingest files when metadata contain non standard characters?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Kos,
I take you up on your challenge ;) However I don't know if this will fix it.

On Tue, Oct 7, 2014 at 11:31 PM, Konstantinos Mavrommatis <
kmavrommatis@celgene.com> wrote:

>
> <val>sailfish quant --index
> /reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex --libtype
> 'T=PE:O=><:S=AS' -1 <(gunzip -c
> /gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R1.fastq.gz)
> -2 <(gunzip -c
> /gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R2.fastq.gz)
> -o
> /gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-transcriptCounts/HP1_3.Sailfish.txt
> -p 8  --no_bias_correct  </val>
>

OK, the code above is what you intially pasted...



>
> <val>sailfish quant --index
> /reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex --libtype
> *&#39;T=PE:O=&gt;&lt;:S=AS&#39;* -1 &lt;(gunzip -c
> /gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R1.fastq.gz)
> -2 &lt;(gunzip -c
> /gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R2.fastq.gz)
> -o
> /gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-transcriptCounts/HP1_3.Sailfish.txt
> -p 8  --no_bias_correct  </val>
>


The code above is what you pasted once you had escaped everything. Did you
do this manually? I get a different output which I've pased below


1sailfish quant --index /reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex
--libtype *'T=PE:O=&gt;&lt;:S=AS'* -1 &lt;(gunzip -c
/gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R1.fastq.gz)
-2 &lt;(gunzip -c
/gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R2.fastq.gz)
-o
/gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-transcriptCounts/HP1_3.Sailfish.txt
-p 8  --no_bias_correct
Please notice the difference in the part which I have boldened. Can you try
reingesting and see if your come up donald trumps?



>
> org.apache.oodt.cas.filemgr.structs.exceptions.IngestException: exception
> ingesting product: [A1_1.Sailfish.sfish]: Message: Failed to ingest product
> [org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] :
> java.lang.Exception:
> org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error
> ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87]
> : HTTP method failed: HTTP/1.1 400 Bad Request
>

BTW, you also have AGAIN highlighted the horrible opaque Product objects we
get as Exception output. I logged an issue for this last week.
https://issues.apache.org/jira/browse/OODT-755
We need to fix this and I will try my damdest to hack it at the weekend.
Thanks
Lewis