CREATE MLMODEL syntax. For details, see CREATE MLMODEL.
Regression Model Common Options
The regression model types have many common options that provide similar functionality. This table describes the purpose of the more common options.| Option | Notes |
|---|---|
metrics | The model supports the metrics option for all regression types. If this option is set to true, then the model also calculates the coefficient of determination (also known as the R^2 value), the root mean squared error (RMSE), and the adjusted R^2 value. If the option is not specified, then the model defaults this option to false. |
threshold | Most regression model types support the threshold option, which enables soft-thresholding. This model type is the same as lasso regression for simple linear regression. However, this model type is not quite the same as lasso regression for other model types. If you specify a non-zero threshold, the model shifts coefficients towards zero by the amount of the threshold. If this shift causes the coefficient to switch from positive to negative or vice versa, the model sets the coefficient to zero. The system performs this coefficient adjustment as a final step before the system saves the model coefficients. Any reported coefficient of determination is based on the model prior to adjustment. The adjustment always causes the coefficient of determination to be smaller. However, this approach is useful as it prevents overfitting. |
gamma | Some regression model types support the gamma option. The value of this option is the Tikhonov square matrix in the form: {{a, b, c, ...}, {d, e, f, ...}, {g, h, i, ...}}. If you do not specify this option, the model defaults the option to the 0 matrix, which equates to multiple linear regression without regularization. You can perform ridge regression by using a gamma value. This is a bit complicated for the polynomial and linear combination regression model types. The system computes these model types by using multiple linear regression after applying the functions to be linearly combined to the input arguments. In these cases, the system applies ridge regression after applying the model functions to the model input values. |
weighted | Some regression model types also support weighted least squares. The model types do not support generalized least squares where the weight matrix can be non-diagonal. To use weighted least squares, add the option weighted and set it to true. Then, add one more column to the result set used to create the model. This new column must appear last, after the dependent variable, and it specifies the weight for that sample. |
Model option names are case-sensitive.
Simple Linear Regression
Model Type:SIMPLE LINEAR REGRESSION
Trains a new machine learning model of type <model_type> on the result set returned by the SQL SELECT statement. After the database creates the model, <model_name> becomes an executable function in SQL SELECT statements.
x). The second column is the dependent variable (referred to as y). The model finds the least squares best fit for y = ax + b.
The simple linear regression model type also supports the yIntercept option, which is the y-intercept of the resulting best-fit line. If you do not specify this option, the model does not force the y-intercept to be any particular value, and the model uses least squares to find the best value instead. If you force the y-intercept to be a particular value, the uses least squares to find the best fit with that constraint.
Model Options
Optional
metrics — If you set this option to true, the model collects quality metrics such as the coefficient of determination (r^2) and the root mean squared error (RMSE). This option defaults to false.
yIntercept — If you set this option, then the option must be a numeric value. The system forces the specific y-intercept (i.e., the model value when x is zero).
threshold — This option enables soft thresholding. If you specify this option, then the option must be a positive numeric value. After the model calculates the coefficients, if any are greater than the threshold value, the threshold value is subtracted from them. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to them. For any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If false, the database deletes any intermediate tables created during model training. If true, the database prevents the deletion of any intermediate tables created during model training. This option defaults to false.
Execute the Model
Create a simple linear regression model with y-intercept10. Collect metrics for the model execution by setting the metrics option to true.
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model after training, the model takes a single numeric argument that represents x and returns ax + b.
sys.simple_linear_regression_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Multiple Linear Regression
Model Type:MULTIPLE LINEAR REGRESSION
Multiple linear regression means that there is a vector of independent variables and the dependent variable is a scalar-valued function of the vector input, and that it is linear in all vector components.
The model uses the result set as an input. The result set has N columns that must be numeric. The first N - 1 columns are the independent variables (it can be considered a single independent variable that is a vector). The last column is the dependent variable. The model finds the least squares best fit for y = a1 * x1 + a2 * x2 + ... + b, or in vector notation, y = ax + b, where a and x are vectors and the multiplication is a dot product.
Model Options
Optional
metrics — If you set this option to true, the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (RMSE). This option defaults to false.
threshold — This option enables soft thresholding. If you specify this option, the option must be a positive numeric value. After the model calculates the coefficients, if any coefficients are greater than the threshold value, the model subtracts the threshold value from the coefficients. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to the coefficients. For any coefficients that are between the negative and positive threshold values, the model sets the coefficients to zero.
weighted — If you set this option to true, the model performs weighted least squares regression, where each sample has a weight or importance associated with it. In this case, there is an extra numeric column after the dependent variable that has the weight for the sample. This option defaults to false.
gamma — If you set this option, the option must be a matrix. The value represents a Tikhonov gamma matrix used for regularization. For details, see Tikhonov regularization. The model uses this option for ridge regression.
yIntercept — If you set this option, then the option must be a numeric value. The system forces the specific y-intercept (i.e., the model value when x is zero).
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If false, the database deletes any intermediate tables created during model training. If true, the database prevents the deletion of any intermediate tables created during model training. This option defaults to false.
Execute the Model
Create a multiple linear regression model. Collect metrics for the model execution by setting themetrics option to true.
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model after training, the independent variables are provided to the model function execution and the function returns the estimate of the dependent variable.
sys.multiple_linear_regression_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Vector Autoregression
Model Type:VECTOR AUTOREGRESSION
Vector auto-regression is a model that estimates the next value of multiple variables based on some number of lags of all variables, as a group. The model tries to build the following:
Estimate <x1(t), x2(t)> based on x1(t-1), x2(t-1), x1(t-2), and x2(t-2)
In this example, x1(t) means the value of x1 at time t, and x1(t-1) means the value of x1 at time t-1 (typically the previous sample time). The syntax <x1(t), x2(t)> demonstrates that the result of the model is a row vector that contains all the predictions of the model and that all predictions rely on all lags of all variables.
When you create a model, the input result set must have one more column than the number of lags. Each column must be a row vector of a size equal to the number of variables. The first column is the un-lagged values, e.g., {{x1, x2, x3}}. The second column is the first lag for all variables, e.g., {{x1_lag1, x2_lag2, x3_lag3}}.
Model Options
Required
numVariables — Specify this option as a positive integer for the number of variables in the model.
numLags — Specify this option as a positive integer for the number of lags in the model.
Optional
metrics — If you set this option to true, the function collects the metric for the coefficient of determination (r^2). This option defaults to false.
threshold — This option enables soft thresholding. If you specify this option, then the option must be a positive numeric value. After the model calculates the coefficients, if any of them are greater than the threshold value, the threshold value is subtracted from them. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to them. For any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If false, the database deletes any intermediate tables created during model training. If true, the database prevents the deletion of any intermediate tables created during model training. This option defaults to false.
Execute the Model
Create a vector autoregression model with three variables and four lags on each one. Collect metrics for the model execution by setting themetrics option to true.
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model after training, the number of arguments you specify must be equal to the number of lags. Each of those arguments must be a row vector that contains lags for all model variables. The first argument is the first lag. The second argument is the second lag, and so on. In this example, the unlagged value is the first lag, which means that the model must predict the next value.
sys.vector_autoregression_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Polynomial Regression
Model Type:POLYNOMIAL REGRESSION
Polynomial regression means that there are one to many independent variables and one dependent variable and that should be modeled in terms of an nth-degree polynomial of the independent variables.
The order option must be set to a positive integer value that specifies the degree of the polynomial to use. If you specify the option negativePowers and set it to true, then the model also includes terms with negative exponents, but still with the restriction that the sum of the absolute value of the exponents in a term will be less than or equal to the value specified on the order option. Regardless of whether you use the negativePowers option, the model computes a coefficient for every possible term that meets this restriction. When you use the negativePowers option, the model contains many more terms. For example, a quadratic model over two independent variables has six terms, but when you use the negativePowers option, the model has 13 terms.
The result set you specify as input to the model has N columns, which must all be numeric. The first N-1 columns are the independent variables (it can be considered a single independent variable that is a vector). The last column is the dependent variable. The model finds the least squares best fit of a sum of all possible combinations of terms where the degree is less than or equal to the value of the order option. For example, with two independent variables (x1 and x2) and order set to 2, the model is y = a1*x1^2 + a2*x2^2 + a3*x1*x2 + a4*x1 + a5*x2 + b.
Model Options
Required
order — This option is the degree of the polynomial and must be set to a positive integer.
Optional
metrics — If you set this option to true, the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (RMSE). This option defaults to false.
threshold — This option enables soft thresholding. If you specify this option, then the option must be a positive numeric value. After the model calculates the coefficients, if any of them are greater than the threshold value, the threshold value is subtracted from them. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to them. For any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero.
weighted — If you set this option to true, the model performs weighted least squares regression, where each sample has an associated weight. When weighted, there is an extra numeric column after the dependent variable that has the weight for the sample. This option defaults to false.
negativePowers — If you set this option to true, the model includes independent variables raised to negative powers. These variables are called Laurent polynomials. The model generates all possible terms such that the sum of the absolute value of the power of each term in each product is less than or equal to the order. For example, with two independent variables and the order set to 2, the model is: y = a1*x1^2 + a2*x1^-2 + a3*x2^2 + a4*x2^-2 + a5*x1*x2 + a6*x1^-1*x2 + a7*x1*x2^-1 + a8*x1^-1*x2^-1 + a9*x1 + a10*x1^-1 + a11*x2 + a12*x2^-1 + b.
gamma — If you specify this option, the value must be a matrix. The value represents a Tikhonov gamma matrix that is used for regularization. For details, see Tikhonov regularization.
yIntercept — If you set this option, then the option must be a numeric value. The system forces the specific y-intercept (i.e., the model value when x is zero).
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If false, the database deletes any intermediate tables created during model training. If true, the database prevents the deletion of any intermediate tables created during model training. This option defaults to false.
Execute the Model
Create a polynomial regression model of degree 3. Collect metrics for the model execution by setting themetrics option to true.
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model after training, the independent variables are provided to the model function execution, and the function returns the estimate of the dependent variable.
sys.polynomial_regression_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Linear Combination Regression
Model Type:LINEAR COMBINATION REGRESSION
Linear combination regression is a model built on top of m independent variables and a single dependent variable. But in this case, the function used to perform least-squares regression is a linear combination of functions that you specify. The general form is y = c0 + c1 * f1(x1, x2, ...) + f2(x1, x2, ...) + ....
The model determines the number of independent variables based on the number of columns in the SQL statement that the model is based on. There is always a column for the dependent variable, and there can be a weight column (see the weighted option). So, the number of independent variables is either one or two less than the number of columns in the result of the input SQL statement. The number of user-specified functions for the model must be specified by defining function1, function2, and so on (keys in the options dictionary). As long as consecutive function key names exist, the model includes these names. The model always includes a constant term. The value strings for the functionN keys must be specified in SQL syntax and should refer to x1, x2, ... for the model input independent variables.
The result set you specify as input to the model has N columns, which must all be numeric. The first N-1 columns are independent variables (it can be considered a single independent variable that is a vector). The last column is the dependent variable. The model finds the least squares best fit for a model of the form y = a1 * f1(x1, x2, ... xn) + a2 * f2(x1, x2, ... xn) + ... + an * fn(x1, x2, ... nx), where f1, f2, ..., fn are functions that are provided in a required option.
Model Options
Required
functionN — You must specify the first function (f1) using a key named 'function1'. Subsequent functions must use keys with names that use subsequent values of N. You must specify functions in SQL syntax and should use the variables x1, x2, ..., xn to refer to the 1st, 2nd, and nth independent variables, respectively. For example,'function1' -> 'sin(x1 * x2 + x3)', 'function2' -> 'cos(x1 * x3)'.
Optional
metrics — If you set this option to true, the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (RMSE). This option defaults to false.
threshold — This option enables soft thresholding. If you specify this option, the option must be a positive numeric value. After the model calculates the coefficients, if any coefficients are greater than the threshold value, the model subtracts the threshold value from the coefficients. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to the coefficients. For any coefficients between the negative and positive threshold values, the model sets the coefficients to zero.
weighted — If you set this option to true, the model performs weighted least squares regression, where each sample has an associated weight or importance. When weighted, there is an extra numeric column after the dependent variable that represents the weight of the sample. This option defaults to false.
gamma — If you specify this option, it must be a matrix. The value represents a Tikhonov gamma matrix used for regularization. For details, see Tikhonov regularization.
yIntercept — If you set this option, then the option must be a numeric value. The system forces the specific y-intercept (i.e., the model value when x is zero).
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If false, the database deletes any intermediate tables created during model training. If true, the database prevents the deletion of any intermediate tables created during model training. This option defaults to false.
Execute the Model
Create a linear combination regression with two functions that exhibit a wave-like pattern.sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model after training, the independent variables are provided to the model function execution, and the function returns the estimate of the dependent variable.
sys.linear_combination_regression_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Nonlinear Regression
Model Type:NONLINEAR REGRESSION
Nonlinear regression finds best-fit parameters of an arbitrary function using an arbitrary loss function. This model type essentially provides direct access to capabilities that both logistic regression and support vector machines rely on. The first N - 1 columns are numeric independent variables, and the last column is the numeric dependent variable. For faster, lower quality models, reduce the popSize, initialIterations, and subsequentIterations options. Conversely, for slower, higher quality models, increase the values for these same options.
Model Options
Required
numParameters — Specify this option as a positive integer. This value specifies the number of different parameters to optimize, i.e., how many different aN variables there are in the user-specified function.
function — Specify the name of the function to fit the data in SQL syntax. Use a1, a2, … to refer to the parameters for optimization. Use x1, x2, … to refer to the input features. The model does not allow some SQL functions. The model allows only scalar expressions that can be represented internally as postfix expressions. Most notably, the model does not allow some functions that are rewritten as CASE statements (like least() and greatest()). If your function is not allowed, the model displays an error message.
Optional
metrics — If you set this option to true, the model calculates the coefficient of determination (r^2), the adjusted r^2, and the root mean squared error (RMSE). However, the model calculates these quality metrics using the least squares loss function, and not the user-specified loss function because these metrics really only make sense for least squares.
lossFunction — If you specify this option, this parameter dictates the nonlinear optimizer the loss function uses on a per-sample basis. Then, the actual loss function is the sum of this function applied to all samples. The model should use the variable y to refer to the dependent variable in the training data and the variable f to refer to the computed estimate for a specified sample. The default is least squares, which could be specified as (f-y)*(f-y).
popSize — If you specify this option, the value must be a positive integer. This option sets the population size for the particle swarm optimization (PSO) part of the algorithm. It defaults to 100.
minInitParamValue — If you specify this option, the value must be a floating-point number. This option sets the minimum for initial parameter values in the optimization algorithm. It defaults to -10.
maxInitParamValue — If you specify this option, the value must be a floating-point number. This option sets the maximum for initial parameter values in the optimization algorithm. It defaults to 10.
initialIterations — If you specify this option, the value must be a positive integer. This option sets the number of PSO iterations for the first PSO pass. This option defaults to 500.
subsequentIterations — If you specify this option, the value must be a positive integer. This value sets the number of subsequent PSO iterations after the initial pass. This option defaults to 100.
momentum — If you specify this option, the value must be a positive floating point number. This parameter controls how much PSO iterations move away from the local best value to explore new territory. This option defaults to 0.1.
gravity — If you specify this option, the value must be a positive floating point number. This parameter controls how much PSO iterations are drawn back towards the local best value. This option defaults to 0.01.
lossFuncNumSamples — If you specify this option, the value must be a positive integer. This parameter controls how many points the model samples when estimating the loss function. This option defaults to 1000.
numGAAttempts — If you specify this option, the value must be a positive integer. This parameter controls how many GA crossover possibilities the model tries. This option defaults to 10 million.
maxLineSearchIterations — If you specify this option, the value must be a positive integer. This parameter controls the maximum allowed number of iterations when the model runs the line search part of the algorithm. This option defaults to 200.
minLineSearchStepSize — If you specify this option, the value must be a positive floating-point number. This parameter controls the minimum step size that the line search algorithm takes. This option defaults to 1e-5.
ridgeCoefficient — If you specify this option, the value must be a double data type. This option is the ridge coefficient for the loss function. The default behavior is the function ignores this option, effectively setting this option to 0.0.
lassoCoefficient — If you specify this option, the value must be a double data type. This option is the lasso coefficient for the loss function. The default behavior is the function ignores this option, effectively setting this option to 0.0.
samplesPerThread — If you specify this option, the value must be a positive integer. This parameter controls the target number of samples sent to each thread. Each thread independently computes a logistic regression model, and the models are all combined at the end. The option defaults to 1 million.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If false, the database deletes any intermediate tables created during model training. If true, the database prevents the deletion of any intermediate tables created during model training. This option defaults to false.
Execute the Model
Create a nonlinear regression that fits five parametersa1, a2, a3, a4, a5 with two independent variables x1, x2.
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When the model executes, pass N - 1 independent variables, and the model returns the estimate of the dependent variable.
sys.nonlinear_regression_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in the System Catalog.

