You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by nt...@peapod.com on 2008/06/09 22:31:32 UTC

Stripping Carriage Returns & Line Feeds?

Is there some "out of the box" way to get Nutch to remove carriage returns and/or line feeds from content as it parses?  I'm finding some places in a crawl I did recently of one of our sites where for some reason there are \n characters in places and I'd like to cut them out.  I'm finding that if there's a \n in the middle of quoted text (such as "Some \n String") the " come out in a browser as ?.  As far as I can tell it's an issue with the content being formatted strangely.  I'm guessing this is a common thing and I'm just missing something?