 Predictive Least Squares - Maple Help

Statistics

 PredictiveLeastSquares
 fit a linear model to data Calling Sequence PredictiveLeastSquares(A, B, v, options) Parameters

 A - Matrix; values of independent variables B - Vector; values of dependent variable v - name; (optional) independent variable name options - (optional) equation(s) of the form option=value where option is one of samplesize, tolerance or numtrials Description

 • The PredictiveLeastSquares command returns a list, P, and a vector, V that best satisfies the equation A[..,P] . x is approximately equal to B according to random trials using a subset of the data for fitting and the remaining data to test the goodness of the fit.
 • This command works best in situations where the problem is underspecified.  That is, the number of variables, or columns in A is of the same order of magnitude or less than the number of observations or rows of A and B.  The returned list, P contains the column index for the variables that have been tested to be most relevant, thus minimizing the effect of outliers and overfitting when using the model to predict new results.
 • A and B must contain numeric entries.  A is a m x n Matrix, and B is a m x 1 vector. Options

 The options argument can contain one or more of the options shown below.
 • numtrials= integer -- Specify how many random subsamples should be used to determine which variables to drop at each phase of regression.  After each sweep that causes one or more variables/columns to be removed, another phase consisting of the specified number of trials is performed.  The default is numtrials=15.
 • tolerance = realcons(nonnegative) -- Set the tolerance that determines whether a fit coefficient can be considered insignificant, and therefore should be removed. This is a relative tolerance, compared to the largest coefficient. The default is 1e-10.
 • samplesize= realcons(nonnegative) -- Provide the fraction of data that will be used for building the model. A setting of samplesize=.7, will cause snapshots using 70% of the data to be used for fitting, and the remaining 30% to be used for testing. This must be a number between 0 and 1.  The default is .55. Notes

 • The underlying computation is done in floating-point; therefore, all data points must have type realcons and all returned solutions are floating-point, even if the problem is specified with exact values.  For more information about numeric computation in the Statistics package, see the Statistics/Computation help page. Examples

 > $\mathrm{with}\left(\mathrm{Statistics}\right):$

In this first example, we have a matrix, A, with 100 columns of data, but the data in B only really depends on the first 4 of those columns.

 > $A≔\mathrm{LinearAlgebra}:-\mathrm{RandomMatrix}\left(100,100,\mathrm{datatype}=\mathrm{float}\left[8\right]\right):$
 > $B≔\mathrm{Vector}\left(100,i↦A\left[i,1\right]+0.1\cdot A\left[i,2\right]+0.5\cdot A\left[i,3\right]-0.3\cdot A\left[i,4\right]\right):$

The permutation vector computed shows the first 4 entries are relevant, and the coefficient vector, LSP, exactly matches the terms used to build B. All other columns not referenced by p can be discarded.

 > $p,\mathrm{LSP}≔\mathrm{PredictiveLeastSquares}\left(A,B\right)$
 ${p}{,}{\mathrm{LSP}}{≔}\left[{1}{,}{2}{,}{3}{,}{4}\right]{,}\left[\begin{array}{c}{1.00000000000000}\\ {0.0999999999999998}\\ {0.500000000000000}\\ {-0.300000000000000}\end{array}\right]$ (1)

In this second example, we will create a result vector that depends on 10 variables, of which only 5 of them are measured in the matrix, A (along with 95 other measurements of irrelevant/random properties).

 > $\mathrm{numsamples}≔50:$
 > $\mathrm{numvariables}≔100:$
 > $Z≔\mathrm{LinearAlgebra}:-\mathrm{RandomMatrix}\left(\mathrm{numsamples},10,\mathrm{datatype}=\mathrm{float}\left[8\right]\right):$
 > $A≔\mathrm{LinearAlgebra}:-\mathrm{RandomMatrix}\left(\mathrm{numsamples},\mathrm{numvariables},\mathrm{datatype}=\mathrm{float}\left[8\right]\right):$
 > $A\left[..,1..5\right]≔Z\left[..,1..5\right]:$
 > $B≔\mathrm{Vector}\left(\mathrm{numsamples},i↦\mathrm{add}\left(\frac{Z\left[i,j\right]}{j},j=1..10\right)\right):$
 > $p,\mathrm{LSP}≔\mathrm{PredictiveLeastSquares}\left(A,B\right):$

The notation A[..,p] will select all the rows of A and only the column indices found in the list p.  This is the reduced matrix.  Note the correlation of B  and (A[..,p].LSP)

 > $\mathrm{Correlation}\left(B,A\left[..,p\right]·\mathrm{LSP}\right)$
 ${0.992800745975950}$ (2)

Compare this with the standard least squares fit.

 > $\mathrm{LS}≔\mathrm{LinearAlgebra}:-\mathrm{LeastSquares}\left(A,B\right):$
 > $\mathrm{Correlation}\left(B,A·\mathrm{LS}\right)$
 ${1.00000000000000}$ (3)

The correlation with the training data is a closer match using standard least squares, but let's see what happens when we use these models to predict results using new data.

 > $\mathrm{Z2}≔\mathrm{LinearAlgebra}:-\mathrm{RandomMatrix}\left(\mathrm{numsamples},10,\mathrm{datatype}=\mathrm{float}\left[8\right]\right):$
 > $\mathrm{A2}≔\mathrm{LinearAlgebra}:-\mathrm{RandomMatrix}\left(\mathrm{numsamples},\mathrm{numvariables},\mathrm{datatype}=\mathrm{float}\left[8\right]\right):$
 > $\mathrm{A2}\left[..,1..5\right]≔\mathrm{Z2}\left[..,1..5\right]:$
 > $\mathrm{GuessLS}≔\mathrm{A2}·\mathrm{LS}:$
 > $\mathrm{GuessLSP}≔\mathrm{A2}\left[..,p\right]·\mathrm{LSP}:$
 > $\mathrm{Actual}≔\mathrm{Vector}\left(\mathrm{numsamples},i↦\mathrm{add}\left(\frac{\mathrm{Z2}\left[i,j\right]}{j},j=1..10\right)\right):$

Note how the correlation of the new data is much better using the predictive model.  The standard model suffers from overfitting.

 > $\mathrm{Correlation}\left(\mathrm{GuessLS},\mathrm{Actual}\right)$
 ${0.727378412013161}$ (4)
 > $\mathrm{Correlation}\left(\mathrm{GuessLSP},\mathrm{Actual}\right)$
 ${0.959754787645569}$ (5) Compatibility

 • The Statistics[PredictiveLeastSquares] command was introduced in Maple 17.