Statistics[SplitByColumn] - split matrix data into submatrices by values of one column
|
Calling Sequence
|
|
SplitByColumn(M, n, options)
SplitByColumn(L, n, options)
|
|
Parameters
|
|
M
|
-
|
Matrix of data
|
n
|
-
|
posint; index of the 'key' column
|
L
|
-
|
list of Vectors or other ordered 1-dimensional data sets; all data sets should have the same number of elements
|
options
|
-
|
(optional) equation(s) of the form option=value where option is one of bounds, output, or ignore; specify options for the SplitByColumn function
|
|
|
|
|
Description
|
|
•
|
The SplitByColumn function splits a Matrix of data into a list of submatrices. Data placement into a submatrix is decided on a row by row basis, depending on the value in a particular column (the 'key' column).
|
|
This is useful if the values in any one row are related, and data in any one column are of the same type. This type of data is sometimes called a Matrix data set. For example, data about house sales in a region may be organized in three columns giving the price, number of bedrooms, and surface area of each house, where the three values belonging to any one sale are in the same row. This function allows you to separate your data into submatrices with all data for sales involving one bedroom houses, two bedroom houses, etc. Many other Statistics commands can then find properties for these submatrices individually. For more details and a list of applicable Statistics commands, see Matrix data set.
|
•
|
Alternatively, if your data is organized into a number of Vectors for each of the types of data (corresponding to the columns in the Matrix case), you can also submit it as a list of Vectors. In this case, n is the index of the 'key' Vector in the list L of Vectors.
|
|
|
Options
|
|
|
The options argument can contain one or more of the options shown below.
|
•
|
bounds - By default, SplitByColumn creates a separate submatrix for every different value in the column of M. This may not be what you want, especially if your key data is more or less continuous (such as the house prices in the example above). In this case, you can use the bounds option to specify boundaries of 'bins' into which values are collected. There are several different ways to use this option:
|
•
|
is the default and leads to a separate 'bin' for every value in the key column.
|
|
These intervals are closed on the left and open on the right; that is, if the exact value occurs in the data, then that row goes into the second matrix, not the first one. The rightmost bin is an exception: it represents an interval closed on both sides.
|
|
Instead of using a list , you can also use a Vector with the same entries. This has the same effect.
|
|
For example, specifying specifies three bins, the first of which contains the lowest 25% of data, then then next 60% - 25% = 35% of data, and finally the following 90% - 60% = 30% of data. That is, the bounds are at the 0th, 25th, 60th, and 90th Percentile of the data in the key column. This could also be specified as using the notion of Quantile instead of Percentile, or using Deciles, or using Quartiles.
|
•
|
output = default or vectorlist or matrix can be used to override the output format for each submatrix: a Matrix or a list of column Vectors. The default is to return the same type of data as was submitted. That is, for the calling sequence involving M, SplitByColumn returns a list of matrices by default, and for the calling sequence involving L, it returns a list of lists of Vectors. This can be overridden with this option.
|
|
In either case, the internal computations occur with a Matrix, so input and/or output of a list of Vectors costs some time and memory for conversion.
|
•
|
ignore = true or false specifies how undefined values in the key column should be treated. With ignore = true, all undefined values (and the corresponding rows) will be removed from the data set. With ignore = false (the default), the behavior of SplitByColumn depends on the value of bounds: with bounds = none, any occurring undefined values in the key column will give rise to a submatrix with undefineds. (If multiple different values of type,undefined occur, then every value gets its own submatrix.) With all other settings of bounds, undefined values will end up in an arbitrary submatrix.
|
|
|
Compatibility
|
|
•
|
The Statistics[SplitByColumn] command was introduced in Maple 16.
|
|
|
Examples
|
|
>
|
|
We construct a Matrix with housing data. The first column has number of bedrooms, the second has number of square feet, the third has price.
>
|
|
| (1) |
We can create box plots of the price for subgroups of sales defined by number of bedrooms.
>
|
|
| (2) |
>
|
|
>
|
|
>
|
|
Looking at this plot, we see that there is one house that is especially small and cheap, and two that are especially large and expensive, and the rest are clustered fairly closely together. Let's split that cluster into two subgroups according to their area. There is a small gap around 950.
>
|
|
| (3) |
>
|
|
| (4) |
>
|
|
We can compute the means of the columns in an individual data set by using the Matrix data set calling sequence of Mean. To apply it to the list of Matrix data sets returned by SplitByColumn, we can use the elementwise version of Mean, obtained by appending a tilde.
>
|
|
| (5) |
>
|
|
| (6) |
We can also split the data into four roughly equally large groups by price. The output is lists of vectors this time.
>
|
|
| (7) |
Let's examine the areas of these houses.
>
|
|
| (8) |
>
|
|
There are undefined values in the following Matrix. If we do not supply a bounds argument, then the undefineds in the third column are grouped into a separate bin.
>
|
|
| (9) |
>
|
|
| (10) |
If we do supply a bounds argument, then the result is not well-defined. In this case, the undefineds are placed in the first half of data.
>
|
|
| (11) |
With the ignore option, the undefined values are removed from the key column. Note that the undefined value in the second column remains untouched.
>
|
|
| (12) |
|
|