GBTRegressionModel (Spark 3.3.0 JavaDoc)

Object
- org.apache.spark.ml.PipelineStage
- - org.apache.spark.ml.Transformer
  - - org.apache.spark.ml.Model<M>
    - - org.apache.spark.ml.PredictionModel<FeaturesType,M>
      - org.apache.spark.ml.regression.RegressionModel<Vector,GBTRegressionModel>
        
        org.apache.spark.ml.regression.GBTRegressionModel

All Implemented Interfaces:

java.io.Serializable, org.apache.spark.internal.Logging, Params, HasCheckpointInterval, HasFeaturesCol, HasLabelCol, HasMaxIter, HasPredictionCol, HasSeed, HasStepSize, HasValidationIndicatorCol, HasWeightCol, PredictorParams, DecisionTreeParams, GBTParams, GBTRegressorParams, HasVarianceImpurity, TreeEnsembleModel<DecisionTreeRegressionModel>, TreeEnsembleParams, TreeEnsembleRegressorParams, TreeRegressorParams, Identifiable, MLWritable
```
public class GBTRegressionModel
extends RegressionModel<Vector,GBTRegressionModel>
implements GBTRegressorParams, TreeEnsembleModel<DecisionTreeRegressionModel>, MLWritable, scala.Serializable
```
Gradient-Boosted Trees (GBTs) model for regression. It supports both continuous and categorical features. param: _trees Decision trees in the ensemble. param: _treeWeights Weights for the decision trees in the ensemble.

See Also:

Serialized Form

Nested Class Summary
- Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
  org.apache.spark.internal.Logging.SparkShellLoggingFilter

Constructor Summary

Constructors
Constructor and Description
`GBTRegressionModel(String uid, DecisionTreeRegressionModel[] _trees, double[] _treeWeights)` Construct a GBTRegressionModel

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`BooleanParam`	`cacheNodeIds()` If false, the algorithm will pass trees to executors to match instances with nodes.
`IntParam`	`checkpointInterval()` Param for set checkpoint interval (>= 1) or disable checkpoint (-1).
`GBTRegressionModel`	`copy(ParamMap extra)` Creates a copy of this instance with the same UID and some extra params.
`double[]`	`evaluateEachIteration(Dataset<?> dataset, String loss)` Method to compute error or loss for every iteration of gradient boosting.
`Vector`	`featureImportances()`
`Param<String>`	`featureSubsetStrategy()` The number of features to consider for splits at each tree node.
`int`	`getNumTrees()` Number of trees in ensemble
`Param<String>`	`impurity()` Criterion used for information gain calculation (case-insensitive).
`Param<String>`	`leafCol()` Leaf indices column name.
`static GBTRegressionModel`	`load(String path)`
`Param<String>`	`lossType()` Loss function which GBT tries to minimize.
`IntParam`	`maxBins()` Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node.
`IntParam`	`maxDepth()` Maximum depth of the tree (nonnegative).
`IntParam`	`maxIter()` Param for maximum number of iterations (>= 0).
`IntParam`	`maxMemoryInMB()` Maximum memory in MB allocated to histogram aggregation.
`DoubleParam`	`minInfoGain()` Minimum information gain for a split to be considered at a tree node.
`IntParam`	`minInstancesPerNode()` Minimum number of instances each child must have after split.
`DoubleParam`	`minWeightFractionPerNode()` Minimum fraction of the weighted sample count that each child must have after split.
`int`	`numFeatures()` Returns the number of features the model was trained on.
`double`	`predict(Vector features)` Predict label for the given features.
`static MLReader<GBTRegressionModel>`	`read()`
`LongParam`	`seed()` Param for random seed.
`DoubleParam`	`stepSize()` Param for Step size (a.k.a.
`DoubleParam`	`subsamplingRate()` Fraction of the training data used for learning each decision tree, in range (0, 1].
`String`	`toString()` Summary of the model
`int`	`totalNumNodes()` Total number of nodes, summed over all trees in the ensemble.
`Dataset<Row>`	`transform(Dataset<?> dataset)` Transforms dataset by reading from `featuresCol`, calling `predict`, and storing the predictions as a new column `predictionCol`.
`StructType`	`transformSchema(StructType schema)` Check transform validity and derive the output schema from the input schema.
`DecisionTreeRegressionModel[]`	`trees()` Trees in this ensemble.
`double[]`	`treeWeights()` Weights for each tree, zippable with `trees`
`String`	`uid()` An immutable unique ID for the object and its derivatives.
`Param<String>`	`validationIndicatorCol()` Param for name of the column that indicates whether each row is for training or for validation.
`DoubleParam`	`validationTol()` Threshold for stopping early when fit with validation is used.
`Param<String>`	`weightCol()` Param for weight column name.
`MLWriter`	`write()` Returns an `MLWriter` instance for this ML instance.

Methods inherited from class org.apache.spark.ml.PredictionModel
featuresCol, labelCol, predictionCol, setFeaturesCol, setPredictionCol

Methods inherited from class org.apache.spark.ml.Model
hasParent, parent, setParent

Methods inherited from class org.apache.spark.ml.Transformer
transform, transform, transform

Methods inherited from class org.apache.spark.ml.PipelineStage
params

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface org.apache.spark.ml.tree.GBTRegressorParams
convertToOldLossType, getLossType, getOldLossType

Methods inherited from interface org.apache.spark.ml.tree.GBTParams
getOldBoostingStrategy, getValidationTol

Methods inherited from interface org.apache.spark.ml.param.shared.HasMaxIter
getMaxIter

Methods inherited from interface org.apache.spark.ml.param.shared.HasStepSize
getStepSize

Methods inherited from interface org.apache.spark.ml.param.shared.HasValidationIndicatorCol
getValidationIndicatorCol

Methods inherited from interface org.apache.spark.ml.tree.TreeEnsembleRegressorParams
validateAndTransformSchema

Methods inherited from interface org.apache.spark.ml.tree.TreeEnsembleParams
getFeatureSubsetStrategy, getOldStrategy, getSubsamplingRate

Methods inherited from interface org.apache.spark.ml.tree.DecisionTreeParams
getCacheNodeIds, getLeafCol, getMaxBins, getMaxDepth, getMaxMemoryInMB, getMinInfoGain, getMinInstancesPerNode, getMinWeightFractionPerNode, getOldStrategy, setLeafCol

Methods inherited from interface org.apache.spark.ml.PredictorParams
extractInstances, extractInstances

Methods inherited from interface org.apache.spark.ml.param.shared.HasLabelCol
getLabelCol, labelCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasFeaturesCol
featuresCol, getFeaturesCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasPredictionCol
getPredictionCol, predictionCol

Methods inherited from interface org.apache.spark.ml.param.Params
clear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn

Methods inherited from interface org.apache.spark.ml.param.shared.HasCheckpointInterval
getCheckpointInterval

Methods inherited from interface org.apache.spark.ml.param.shared.HasSeed
getSeed

Methods inherited from interface org.apache.spark.ml.param.shared.HasWeightCol
getWeightCol

Methods inherited from interface org.apache.spark.ml.tree.HasVarianceImpurity
getImpurity, getOldImpurity

Methods inherited from interface org.apache.spark.ml.tree.TreeEnsembleModel
getLeafField, javaTreeWeights, predictLeaf, toDebugString

Methods inherited from interface org.apache.spark.ml.util.MLWritable
save

Methods inherited from interface org.apache.spark.internal.Logging
$init$, initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, initLock, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log__$eq, org$apache$spark$internal$Logging$$log_, uninitialize

- Constructor Detail
  - GBTRegressionModel
```
public GBTRegressionModel(String uid,
                          DecisionTreeRegressionModel[] _trees,
                          double[] _treeWeights)
```
    Construct a GBTRegressionModel
    
    Parameters:
    
    _trees - Decision trees in the ensemble.
    
    _treeWeights - Weights for the decision trees in the ensemble.
    
    uid - (undocumented)
- Method Detail
  - read
```
public static MLReader<GBTRegressionModel> read()
```
  - load
```
public static GBTRegressionModel load(String path)
```
  - totalNumNodes
```
public int totalNumNodes()
```
    Description copied from interface: TreeEnsembleModel
    
    Total number of nodes, summed over all trees in the ensemble.
    
    Specified by:
    
    totalNumNodes in interface TreeEnsembleModel<DecisionTreeRegressionModel>
  - lossType
```
public Param<String> lossType()
```
    Description copied from interface: GBTRegressorParams
    
    Loss function which GBT tries to minimize. (case-insensitive) Supported: "squared" (L2) and "absolute" (L1) (default = squared)
    
    Specified by:
    
    lossType in interface GBTRegressorParams
    
    Returns:
    
    (undocumented)
  - impurity
```
public final Param<String> impurity()
```
    Description copied from interface: HasVarianceImpurity
    
    Criterion used for information gain calculation (case-insensitive). This impurity type is used in DecisionTreeRegressor, RandomForestRegressor, GBTRegressor and GBTClassifier (since GBTClassificationModel is internally composed of DecisionTreeRegressionModels). Supported: "variance". (default = variance)
    
    Specified by:
    
    impurity in interface HasVarianceImpurity
    
    Returns:
    
    (undocumented)
  - validationTol
```
public final DoubleParam validationTol()
```
    Description copied from interface: GBTParams
    
    Threshold for stopping early when fit with validation is used. (This parameter is ignored when fit without validation is used.) The decision to stop early is decided based on this logic: If the current loss on the validation set is greater than 0.01, the diff of validation error is compared to relative tolerance which is validationTol * (current loss on the validation set). If the current loss on the validation set is less than or equal to 0.01, the diff of validation error is compared to absolute tolerance which is validationTol * 0.01.
    
    Specified by:
    
    validationTol in interface GBTParams
    
    Returns:
    
    (undocumented)
    
    See Also:
    
    validationIndicatorCol
  - stepSize
```
public final DoubleParam stepSize()
```
    Description copied from interface: GBTParams
    
    Param for Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator. (default = 0.1)
    
    Specified by:
    
    stepSize in interface HasStepSize
    
    Specified by:
    
    stepSize in interface GBTParams
    
    Returns:
    
    (undocumented)
  - validationIndicatorCol
```
public final Param<String> validationIndicatorCol()
```
    Description copied from interface: HasValidationIndicatorCol
    
    Param for name of the column that indicates whether each row is for training or for validation. False indicates training; true indicates validation..
    
    Specified by:
    
    validationIndicatorCol in interface HasValidationIndicatorCol
    
    Returns:
    
    (undocumented)
  - maxIter
```
public final IntParam maxIter()
```
    Description copied from interface: HasMaxIter
    
    Param for maximum number of iterations (>= 0).
    
    Specified by:
    
    maxIter in interface HasMaxIter
    
    Returns:
    
    (undocumented)
  - subsamplingRate
```
public final DoubleParam subsamplingRate()
```
    Description copied from interface: TreeEnsembleParams
    
    Fraction of the training data used for learning each decision tree, in range (0, 1]. (default = 1.0)
    
    Specified by:
    
    subsamplingRate in interface TreeEnsembleParams
    
    Returns:
    
    (undocumented)
  - featureSubsetStrategy
```
public final Param<String> featureSubsetStrategy()
```
    Description copied from interface: TreeEnsembleParams
    
    The number of features to consider for splits at each tree node. Supported options: - "auto": Choose automatically for task: If numTrees == 1, set to "all." If numTrees greater than 1 (forest), set to "sqrt" for classification and to "onethird" for regression. - "all": use all features - "onethird": use 1/3 of the features - "sqrt": use sqrt(number of features) - "log2": use log2(number of features) - "n": when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features. (default = "auto")
    These various settings are based on the following references: - log2: tested in Breiman (2001) - sqrt: recommended by Breiman manual for random forests - The defaults of sqrt (classification) and onethird (regression) match the R randomForest package.
    
    Specified by:
    
    featureSubsetStrategy in interface TreeEnsembleParams
    
    Returns:
    
    (undocumented)
    
    See Also:
    
    Breiman (2001), Breiman manual for random forests
  - leafCol
```
public final Param<String> leafCol()
```
    Description copied from interface: DecisionTreeParams
    
    Leaf indices column name. Predicted leaf index of each instance in each tree by preorder. (default = "")
    
    Specified by:
    
    leafCol in interface DecisionTreeParams
    
    Returns:
    
    (undocumented)
  - maxDepth
```
public final IntParam maxDepth()
```
    Description copied from interface: DecisionTreeParams
    
    Maximum depth of the tree (nonnegative). E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (default = 5)
    
    Specified by:
    
    maxDepth in interface DecisionTreeParams
    
    Returns:
    
    (undocumented)
  - maxBins
```
public final IntParam maxBins()
```
    Description copied from interface: DecisionTreeParams
    
    Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. Must be at least 2 and at least number of categories in any categorical feature. (default = 32)
    
    Specified by:
    
    maxBins in interface DecisionTreeParams
    
    Returns:
    
    (undocumented)
  - minInstancesPerNode
```
public final IntParam minInstancesPerNode()
```
    Description copied from interface: DecisionTreeParams
    
    Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Must be at least 1. (default = 1)
    
    Specified by:
    
    minInstancesPerNode in interface DecisionTreeParams
    
    Returns:
    
    (undocumented)
  - minWeightFractionPerNode
```
public final DoubleParam minWeightFractionPerNode()
```
    Description copied from interface: DecisionTreeParams
    
    Minimum fraction of the weighted sample count that each child must have after split. If a split causes the fraction of the total weight in the left or right child to be less than minWeightFractionPerNode, the split will be discarded as invalid. Should be in the interval [0.0, 0.5). (default = 0.0)
    
    Specified by:
    
    minWeightFractionPerNode in interface DecisionTreeParams
    
    Returns:
    
    (undocumented)
  - minInfoGain
```
public final DoubleParam minInfoGain()
```
    Description copied from interface: DecisionTreeParams
    
    Minimum information gain for a split to be considered at a tree node. Should be at least 0.0. (default = 0.0)
    
    Specified by:
    
    minInfoGain in interface DecisionTreeParams
    
    Returns:
    
    (undocumented)
  - maxMemoryInMB
```
public final IntParam maxMemoryInMB()
```
    Description copied from interface: DecisionTreeParams
    
    Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size. (default = 256 MB)
    
    Specified by:
    
    maxMemoryInMB in interface DecisionTreeParams
    
    Returns:
    
    (undocumented)
  - cacheNodeIds
```
public final BooleanParam cacheNodeIds()
```
    Description copied from interface: DecisionTreeParams
    
    If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default = false)
    
    Specified by:
    
    cacheNodeIds in interface DecisionTreeParams
    
    Returns:
    
    (undocumented)
  - weightCol
```
public final Param<String> weightCol()
```
    Description copied from interface: HasWeightCol
    
    Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.
    
    Specified by:
    
    weightCol in interface HasWeightCol
    
    Returns:
    
    (undocumented)
  - seed
```
public final LongParam seed()
```
    Description copied from interface: HasSeed
    
    Param for random seed.
    
    Specified by:
    
    seed in interface HasSeed
    
    Returns:
    
    (undocumented)
  - checkpointInterval
```
public final IntParam checkpointInterval()
```
    Description copied from interface: HasCheckpointInterval
    
    Param for set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.
    
    Specified by:
    
    checkpointInterval in interface HasCheckpointInterval
    
    Returns:
    
    (undocumented)
  - uid
```
public String uid()
```
    Description copied from interface: Identifiable
    
    An immutable unique ID for the object and its derivatives.
    
    Specified by:
    
    uid in interface Identifiable
    
    Returns:
    
    (undocumented)
  - numFeatures
```
public int numFeatures()
```
    Description copied from class: PredictionModel
    
    Returns the number of features the model was trained on. If unknown, returns -1
    
    Overrides:
    
    numFeatures in class PredictionModel<Vector,GBTRegressionModel>
  - trees
```
public DecisionTreeRegressionModel[] trees()
```
    Description copied from interface: TreeEnsembleModel
    
    Trees in this ensemble. Warning: These have null parent Estimators.
    
    Specified by:
    
    trees in interface TreeEnsembleModel<DecisionTreeRegressionModel>
  - getNumTrees
```
public int getNumTrees()
```
    Number of trees in ensemble
    
    Returns:
    
    (undocumented)
  - treeWeights
```
public double[] treeWeights()
```
    Description copied from interface: TreeEnsembleModel
    
    Weights for each tree, zippable with trees
    
    Specified by:
    
    treeWeights in interface TreeEnsembleModel<DecisionTreeRegressionModel>
  - transformSchema
```
public StructType transformSchema(StructType schema)
```
    Description copied from class: PipelineStage
    
    Check transform validity and derive the output schema from the input schema.
    We check validity for interactions between parameters during transformSchema and raise an exception if any parameter value is invalid. Parameter value checks which do not depend on other parameters are handled by Param.validate().
    Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks.
    
    Overrides:
    
    transformSchema in class PredictionModel<Vector,GBTRegressionModel>
    
    Parameters:
    
    schema - (undocumented)
    
    Returns:
    
    (undocumented)
  - transform
```
public Dataset<Row> transform(Dataset<?> dataset)
```
    Description copied from class: PredictionModel
    
    Transforms dataset by reading from featuresCol, calling predict, and storing the predictions as a new column predictionCol.
    
    Overrides:
    
    transform in class PredictionModel<Vector,GBTRegressionModel>
    
    Parameters:
    
    dataset - input dataset
    
    Returns:
    
    transformed dataset with predictionCol of type Double
  - predict
```
public double predict(Vector features)
```
    Description copied from class: PredictionModel
    
    Predict label for the given features. This method is used to implement transform() and output predictionCol.
    
    Specified by:
    
    predict in class PredictionModel<Vector,GBTRegressionModel>
    
    Parameters:
    
    features - (undocumented)
    
    Returns:
    
    (undocumented)
  - copy
```
public GBTRegressionModel copy(ParamMap extra)
```
    Description copied from interface: Params
    
    Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy().
    
    Specified by:
    
    copy in interface Params
    
    Specified by:
    
    copy in class Model<GBTRegressionModel>
    
    Parameters:
    
    extra - (undocumented)
    
    Returns:
    
    (undocumented)
  - toString
```
public String toString()
```
    Description copied from interface: TreeEnsembleModel
    
    Summary of the model
    
    Specified by:
    
    toString in interface TreeEnsembleModel<DecisionTreeRegressionModel>
    
    Specified by:
    
    toString in interface Identifiable
    
    Overrides:
    
    toString in class Object
  - featureImportances
```
public Vector featureImportances()
```
  - evaluateEachIteration
```
public double[] evaluateEachIteration(Dataset<?> dataset,
                                      String loss)
```
    Method to compute error or loss for every iteration of gradient boosting.
    
    Parameters:
    
    dataset - Dataset for validation.
    
    loss - The loss function used to compute error. Supported options: squared, absolute
    
    Returns:
    
    (undocumented)
  - write
```
public MLWriter write()
```
    Description copied from interface: MLWritable
    
    Returns an MLWriter instance for this ML instance.
    
    Specified by:
    
    write in interface MLWritable
    
    Returns:
    
    (undocumented)

Class GBTRegressionModel

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class org.apache.spark.ml.PredictionModel

Methods inherited from class org.apache.spark.ml.Model

Methods inherited from class org.apache.spark.ml.Transformer

Methods inherited from class org.apache.spark.ml.PipelineStage

Methods inherited from class Object

Methods inherited from interface org.apache.spark.ml.tree.GBTRegressorParams

Methods inherited from interface org.apache.spark.ml.tree.GBTParams

Methods inherited from interface org.apache.spark.ml.param.shared.HasMaxIter

Methods inherited from interface org.apache.spark.ml.param.shared.HasStepSize

Methods inherited from interface org.apache.spark.ml.param.shared.HasValidationIndicatorCol

Methods inherited from interface org.apache.spark.ml.tree.TreeEnsembleRegressorParams

Methods inherited from interface org.apache.spark.ml.tree.TreeEnsembleParams

Methods inherited from interface org.apache.spark.ml.tree.DecisionTreeParams

Methods inherited from interface org.apache.spark.ml.PredictorParams

Methods inherited from interface org.apache.spark.ml.param.shared.HasLabelCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasFeaturesCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasPredictionCol

Methods inherited from interface org.apache.spark.ml.param.Params

Methods inherited from interface org.apache.spark.ml.param.shared.HasCheckpointInterval

Methods inherited from interface org.apache.spark.ml.param.shared.HasSeed

Methods inherited from interface org.apache.spark.ml.param.shared.HasWeightCol

Methods inherited from interface org.apache.spark.ml.tree.HasVarianceImpurity

Methods inherited from interface org.apache.spark.ml.tree.TreeEnsembleModel

Methods inherited from interface org.apache.spark.ml.util.MLWritable

Methods inherited from interface org.apache.spark.internal.Logging

Constructor Detail

GBTRegressionModel

Method Detail

read

load

totalNumNodes

lossType

impurity

validationTol

stepSize

validationIndicatorCol

maxIter

subsamplingRate

featureSubsetStrategy

leafCol

maxDepth

maxBins

minInstancesPerNode

minWeightFractionPerNode

minInfoGain

maxMemoryInMB

cacheNodeIds

weightCol

seed

checkpointInterval

uid

numFeatures

trees

getNumTrees

treeWeights

transformSchema

transform

predict

copy

toString

featureImportances

evaluateEachIteration

write