You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by yh18190 <yh...@gmail.com> on 2015/03/12 17:26:34 UTC

How to consider HTML files in Spark

Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark +
Beautifulsoup to parse HTML files.I am facing problems to load html file
into beautiful soup.
Example
filepath= file:///path to html directory
def readhtml(inputhtml):
{
soup=Beautifulsoup(inputhtml) //to load html content
}
loaddata=sc.textFile(filepath).map(readhtml)

The problem is here spark considers loaded file as textfile and goes through
process line by line.I want to consider to load the entire html content into
Beautifulsoup for further processing..
Does anyone have any idea to how to take the whole html file as input
instead of linebyline processing?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-consider-HTML-files-in-Spark-tp22017.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to consider HTML files in Spark

Posted by Davies Liu <da...@databricks.com>.

sc.wholeTextFile() is what you need.

http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.wholeTextFiles

On Thu, Mar 12, 2015 at 9:26 AM, yh18190 <yh...@gmail.com> wrote:
> Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark +
> Beautifulsoup to parse HTML files.I am facing problems to load html file
> into beautiful soup.
> Example
> filepath= file:///path to html directory
> def readhtml(inputhtml):
> {
> soup=Beautifulsoup(inputhtml) //to load html content
> }
> loaddata=sc.textFile(filepath).map(readhtml)
>
> The problem is here spark considers loaded file as textfile and goes through
> process line by line.I want to consider to load the entire html content into
> Beautifulsoup for further processing..
> Does anyone have any idea to how to take the whole html file as input
> instead of linebyline processing?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-consider-HTML-files-in-Spark-tp22017.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org