Sometimes you want to convert a list of Java objects to an RDD and you may know that calling sc.parallelize(list) will do that. But the tricky point is that sc here is not SparkContext here for Java. You need JavaSparkContext to do that:
The Spark document may confuse you that it can parallelize any collection, actually you can only parallelize List in Java.
The relations between JavaSparkContext and SparkContext is that: JavaSparkContext is a wrapper of SparkContext. You can use JavaSparkContext as SparkContext sometimes but you can also get the object of SparkContext from JavaSparkContext by jsc.sc() . For example, you want to write a SparkML model to a filesystem, but it requires the object of SparkContext not JavaSparkContext, here is the code to get it works:
No comments:
Post a Comment