You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Matthew Woodward <ma...@mattwoodward.com> on 2010/11/17 01:46:33 UTC

Best Way to Handle Inserting HTML Data

I have a SQL Server database I'm converting to CouchDB. I wrote a script
that pulls data form SQL Server, denormalizes into the JSON structure I want
to have in CouchDB, and then makes an HTTP call to insert each record.

The issue I'm running into is some of the fields in SQL Server have HTML in
them, and I'm not sure the best way to handle this. It's easy enough to
brute-force escape all the HTML tags, quotes, etc. but then the data won't
display as HTML when I want to display it on a web page.

The more I thought about it, I think the only character that will really
cause issues with the JSON (other than high-ASCII MS Word junk, which I'm
stripping out/converting) is ". Here's an example:

myJSONString = '{"foo":"bar", "baz":"<a href="http://foo.com">Here's a
hyperlink</a>"}';

So if I don't do any sort of escaping the first " in the href will cause the
JSON to be invalid. But whereas I can change " to &quot; in regular text
(e.g. "Here's a quote" can become &quot;Here's a quote&quot; and still be
fine), that won't really work with hyperlinks since <a href=&quot;
http://foo.com&quot;> won't work properly I don't believe, or at any rate
isn't all that desirable.

If I escape the " with \ (using the example above, this becomes <a href=\"
http://foo.com\">) that lets me get the document into Couch successfully,
but I have a feeling that isn't going to work if I just want to display that
field as HTML on a web page (though I have yet to experiment with that to
see what happens when it's rendered).

Anyway, I've been going around in circles about how best to handle this and
since I'm new to Couch, I figured I'd ask to see what the conventional
wisdom/"best practice" is for something like this. I'm happy to go a
different route completely (write JSON to a file first and load using curl,
whatever works ...) if my current approach isn't the best way of going about
this.

Thanks!

-- 
Matthew Woodward
matt@mattwoodward.com
http://blog.mattwoodward.com
identi.ca / Twitter: @mpwoodward

Please do not send me proprietary file formats such as Word, PowerPoint,
etc. as attachments.
http://www.gnu.org/philosophy/no-word-attachments.html

Re: Best Way to Handle Inserting HTML Data

Posted by Matthew Woodward <ma...@mattwoodward.com>.
On Tue, Nov 16, 2010 at 5:35 PM, Jonathan Stott <jo...@gmail.com>wrote:

> I suggest not doing JSON (de)serialization on your own.
>

Well, huge duh on my part. ;-) Not sure why I was making it so much harder
than it needed to be. I was already taking the JSON and deserializing into
native objects, so I'm not sure why it didn't dawn on my to go the other
direction for my inserts.

Thanks for the shove in the right direction!

Matt

-- 
Matthew Woodward
matt@mattwoodward.com
http://blog.mattwoodward.com
identi.ca / Twitter: @mpwoodward

Please do not send me proprietary file formats such as Word, PowerPoint,
etc. as attachments.
http://www.gnu.org/philosophy/no-word-attachments.html

Re: Best Way to Handle Inserting HTML Data

Posted by Matthew Woodward <ma...@mattwoodward.com>.
On Tue, Nov 16, 2010 at 5:35 PM, Jonathan Stott <jo...@gmail.com>wrote:

> I suggest not doing JSON (de)serialization on your own.
>
> Make your script do all the denormalization, and convert the result to
> a hash/dictionary.  Almost all languages have a JSON encoder that will
> do the correct escaping for you. POST it to couch DB.  On the way out,
> JSON decode the result, which will unescape things appropriately, and
> give you back a dictionary/hash.
>

Thanks! I'll give that approach a shot and see where I get.

-- 
Matthew Woodward
matt@mattwoodward.com
http://blog.mattwoodward.com
identi.ca / Twitter: @mpwoodward

Please do not send me proprietary file formats such as Word, PowerPoint,
etc. as attachments.
http://www.gnu.org/philosophy/no-word-attachments.html

Re: Best Way to Handle Inserting HTML Data

Posted by Jonathan Stott <jo...@gmail.com>.
I suggest not doing JSON (de)serialization on your own.

Make your script do all the denormalization, and convert the result to
a hash/dictionary.  Almost all languages have a JSON encoder that will
do the correct escaping for you. POST it to couch DB.  On the way out,
JSON decode the result, which will unescape things appropriately, and
give you back a dictionary/hash.

Regards
Jon

On 17 November 2010 01:27, Matthew Woodward <ma...@mattwoodward.com> wrote:
> On Tue, Nov 16, 2010 at 5:07 PM, Mike Miller <mi...@cloudant.com> wrote:
>
>> Can you just save the HTML as an attachment?  -Mike
>
>
> Well ... I thought about that too. Here's the issue.
>
> Basically the database is of an event with some high-level details (name of
> event, date/time, location, etc.), and an array of attendees. The attendee
> JSON object has name, title, etc. and then an attendee statement, which is
> where the HTML comes in since these statements can be either hyperlinks or
> full-blown HTML documents right in the database.
>
> Right now I have the event as my document in my CouchDB database, and then I
> have an array of attendees. So if I have 10 attendees each with a statement,
> then if I do attachments I'd have to figure out how to map the statement
> back to the attendee within my attendee array. So I considered doing
> attachments but couldn't figure out how to handle knowing which attachment
> went with which attendee.
> --
> Matthew Woodward
> matt@mattwoodward.com
> http://blog.mattwoodward.com
> identi.ca / Twitter: @mpwoodward
>
> Please do not send me proprietary file formats such as Word, PowerPoint,
> etc. as attachments.
> http://www.gnu.org/philosophy/no-word-attachments.html
>

Re: Best Way to Handle Inserting HTML Data

Posted by Matthew Woodward <ma...@mattwoodward.com>.
On Tue, Nov 16, 2010 at 5:07 PM, Mike Miller <mi...@cloudant.com> wrote:

> Can you just save the HTML as an attachment?  -Mike


Well ... I thought about that too. Here's the issue.

Basically the database is of an event with some high-level details (name of
event, date/time, location, etc.), and an array of attendees. The attendee
JSON object has name, title, etc. and then an attendee statement, which is
where the HTML comes in since these statements can be either hyperlinks or
full-blown HTML documents right in the database.

Right now I have the event as my document in my CouchDB database, and then I
have an array of attendees. So if I have 10 attendees each with a statement,
then if I do attachments I'd have to figure out how to map the statement
back to the attendee within my attendee array. So I considered doing
attachments but couldn't figure out how to handle knowing which attachment
went with which attendee.
-- 
Matthew Woodward
matt@mattwoodward.com
http://blog.mattwoodward.com
identi.ca / Twitter: @mpwoodward

Please do not send me proprietary file formats such as Word, PowerPoint,
etc. as attachments.
http://www.gnu.org/philosophy/no-word-attachments.html

Re: Best Way to Handle Inserting HTML Data

Posted by Mike Miller <mi...@cloudant.com>.
Can you just save the HTML as an attachment?  -Mike

On Nov 16, 2010, at 7:46 PM, Matthew Woodward <ma...@mattwoodward.com>  
wrote:

> I have a SQL Server database I'm converting to CouchDB. I wrote a  
> script
> that pulls data form SQL Server, denormalizes into the JSON  
> structure I want
> to have in CouchDB, and then makes an HTTP call to insert each record.
>
> The issue I'm running into is some of the fields in SQL Server have  
> HTML in
> them, and I'm not sure the best way to handle this. It's easy enough  
> to
> brute-force escape all the HTML tags, quotes, etc. but then the data  
> won't
> display as HTML when I want to display it on a web page.
>
> The more I thought about it, I think the only character that will  
> really
> cause issues with the JSON (other than high-ASCII MS Word junk,  
> which I'm
> stripping out/converting) is ". Here's an example:
>
> myJSONString = '{"foo":"bar", "baz":"<a href="http://foo.com">Here's a
> hyperlink</a>"}';
>
> So if I don't do any sort of escaping the first " in the href will  
> cause the
> JSON to be invalid. But whereas I can change " to &quot; in regular  
> text
> (e.g. "Here's a quote" can become &quot;Here's a quote&quot; and  
> still be
> fine), that won't really work with hyperlinks since <a href=&quot;
> http://foo.com&quot;> won't work properly I don't believe, or at any  
> rate
> isn't all that desirable.
>
> If I escape the " with \ (using the example above, this becomes <a  
> href=\"
> http://foo.com\">) that lets me get the document into Couch  
> successfully,
> but I have a feeling that isn't going to work if I just want to  
> display that
> field as HTML on a web page (though I have yet to experiment with  
> that to
> see what happens when it's rendered).
>
> Anyway, I've been going around in circles about how best to handle  
> this and
> since I'm new to Couch, I figured I'd ask to see what the conventional
> wisdom/"best practice" is for something like this. I'm happy to go a
> different route completely (write JSON to a file first and load  
> using curl,
> whatever works ...) if my current approach isn't the best way of  
> going about
> this.
>
> Thanks!
>
> -- 
> Matthew Woodward
> matt@mattwoodward.com
> http://blog.mattwoodward.com
> identi.ca / Twitter: @mpwoodward
>
> Please do not send me proprietary file formats such as Word,  
> PowerPoint,
> etc. as attachments.
> http://www.gnu.org/philosophy/no-word-attachments.html

Re: Best Way to Handle Inserting HTML Data

Posted by Matthew Woodward <ma...@mattwoodward.com>.
2010/11/16 Rafał Pocztarski <r....@gmail.com>

> It's easy to get it wrong so that's why I second Jonathan's advice to
> use a ready JSON encoder/decoder library and not worry about it
> yourself.
>
> By the way, I'm new on this list so hello to everyone.
>

Thanks Rafał (and hello!)--this approach is working like a charm.

-- 
Matthew Woodward
matt@mattwoodward.com
http://blog.mattwoodward.com
identi.ca / Twitter: @mpwoodward

Please do not send me proprietary file formats such as Word, PowerPoint,
etc. as attachments.
http://www.gnu.org/philosophy/no-word-attachments.html

Re: Best Way to Handle Inserting HTML Data

Posted by Rafał Pocztarski <r....@gmail.com>.
2010/11/17 Matthew Woodward <ma...@mattwoodward.com>:
> myJSONString = '{"foo":"bar", "baz":"<a href="http://foo.com">Here's a
> hyperlink</a>"}';
>
> So if I don't do any sort of escaping the first " in the href will cause the
> JSON to be invalid. But whereas I can change " to &quot; in regular text
> (e.g. "Here's a quote" can become &quot;Here's a quote&quot; and still be
> fine), that won't really work with hyperlinks since <a href=&quot;
> http://foo.com&quot;> won't work properly I don't believe, or at any rate
> isn't all that desirable.
>
> If I escape the " with \ (using the example above, this becomes <a href=\"
> http://foo.com\">) that lets me get the document into Couch successfully,
> but I have a feeling that isn't going to work if I just want to display that
> field as HTML on a web page (though I have yet to experiment with that to
> see what happens when it's rendered).

I'll just add to what others have already said that no matter what,
why and how you escape, whether manually or automatically, what should
happen is to always escape once what gets in and then always unescape
once what gets out. (I say "once" because it's easy to escape
something twice and unescape once or vice versa and have a problem.)

I say it because you worry that escaped strings might not work in HTML
but it would be true only if you forget to unescape them when you get
them out and you have to do it anyway when you decode JSON.

The correct way to store any text in JSON is to always escape the
double quote, backslash and control characters (" as \", \ as \\ etc.)
and the correct way to decode strings in JSON is to unescape
everything that starts with backslash - see http://json.org/

It's easy to get it wrong so that's why I second Jonathan's advice to
use a ready JSON encoder/decoder library and not worry about it
yourself.

By the way, I'm new on this list so hello to everyone.

Best regards,
Rafał Pocztarski.