RandomForest

class pyspark.mllib.tree.RandomForest[source]

Learning algorithm for a random forest model for classification or regression.

New in version 1.2.0.

Methods

trainClassifier(data, numClasses, …[, …])

Train a random forest model for binary or multiclass classification.

trainRegressor(data, …[, …])

Train a random forest model for regression.

Attributes

supportedFeatureSubsetStrategies

Methods Documentation

classmethod trainClassifier(data, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None)[source]

Train a random forest model for binary or multiclass classification.

New in version 1.2.0.

Parameters:
datapyspark.RDD

Training dataset: RDD of LabeledPoint. Labels should take values {0, 1, …, numClasses-1}.

numClassesint

Number of classes for classification.

categoricalFeaturesInfodict

Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, …, k-1}.

numTreesint

Number of trees in the random forest.

featureSubsetStrategystr, optional

Number of features to consider for splits at each node. Supported values: “auto”, “all”, “sqrt”, “log2”, “onethird”. If “auto” is set, this parameter is set based on numTrees: if numTrees == 1, set to “all”; if numTrees > 1 (forest) set to “sqrt”. (default: “auto”)

impuritystr, optional

Criterion used for information gain calculation. Supported values: “gini” or “entropy”. (default: “gini”)

maxDepthint, optional

Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 4)

maxBinsint, optional

Maximum number of bins used for splitting features. (default: 32)

seedint, Optional

Random seed for bootstrapping and choosing feature subsets. Set as None to generate seed based on system time. (default: None)

Returns:
RandomForestModel

that can be used for prediction.

Examples

>>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.mllib.tree import RandomForest
>>>
>>> data = [
...     LabeledPoint(0.0, [0.0]),
...     LabeledPoint(0.0, [1.0]),
...     LabeledPoint(1.0, [2.0]),
...     LabeledPoint(1.0, [3.0])
... ]
>>> model = RandomForest.trainClassifier(sc.parallelize(data), 2, {}, 3, seed=42)
>>> model.numTrees()
3
>>> model.totalNumNodes()
7
>>> print(model)
TreeEnsembleModel classifier with 3 trees

>>> print(model.toDebugString())
TreeEnsembleModel classifier with 3 trees

  Tree 0:
    Predict: 1.0
  Tree 1:
    If (feature 0 <= 1.5)
     Predict: 0.0
    Else (feature 0 > 1.5)
     Predict: 1.0
  Tree 2:
    If (feature 0 <= 1.5)
     Predict: 0.0
    Else (feature 0 > 1.5)
     Predict: 1.0

>>> model.predict([2.0])
1.0
>>> model.predict([0.0])
0.0
>>> rdd = sc.parallelize([[3.0], [1.0]])
>>> model.predict(rdd).collect()
[1.0, 0.0]
classmethod trainRegressor(data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None)[source]

Train a random forest model for regression.

New in version 1.2.0.

Parameters:
datapyspark.RDD

Training dataset: RDD of LabeledPoint. Labels are real numbers.

categoricalFeaturesInfodict

Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, …, k-1}.

numTreesint

Number of trees in the random forest.

featureSubsetStrategystr, optional

Number of features to consider for splits at each node. Supported values: “auto”, “all”, “sqrt”, “log2”, “onethird”. If “auto” is set, this parameter is set based on numTrees:

  • if numTrees == 1, set to “all”;

  • if numTrees > 1 (forest) set to “onethird” for regression.

(default: “auto”)

impuritystr, optional

Criterion used for information gain calculation. The only supported value for regression is “variance”. (default: “variance”)

maxDepthint, optional

Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 4)

maxBinsint, optional

Maximum number of bins used for splitting features. (default: 32)

seedint, optional

Random seed for bootstrapping and choosing feature subsets. Set as None to generate seed based on system time. (default: None)

Returns:
RandomForestModel

that can be used for prediction.

Examples

>>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.mllib.tree import RandomForest
>>> from pyspark.mllib.linalg import SparseVector
>>>
>>> sparse_data = [
...     LabeledPoint(0.0, SparseVector(2, {0: 1.0})),
...     LabeledPoint(1.0, SparseVector(2, {1: 1.0})),
...     LabeledPoint(0.0, SparseVector(2, {0: 1.0})),
...     LabeledPoint(1.0, SparseVector(2, {1: 2.0}))
... ]
>>>
>>> model = RandomForest.trainRegressor(sc.parallelize(sparse_data), {}, 2, seed=42)
>>> model.numTrees()
2
>>> model.totalNumNodes()
4
>>> model.predict(SparseVector(2, {1: 1.0}))
1.0
>>> model.predict(SparseVector(2, {0: 1.0}))
0.5
>>> rdd = sc.parallelize([[0.0, 1.0], [1.0, 0.0]])
>>> model.predict(rdd).collect()
[1.0, 0.5]

Attributes Documentation

supportedFeatureSubsetStrategies = ('auto', 'all', 'sqrt', 'log2', 'onethird')