split matrix data into submatrices by values of one column - Maple Help

Online Help

All Products    Maple    MapleSim

Home : Support : Online Help : Statistics : Statistics Package : Data Manipulation : Statistics/SplitByColumn

Statistics[SplitByColumn] - split matrix data into submatrices by values of one column

Calling Sequence

SplitByColumn(M, n, options)

SplitByColumn(L, n, options)




Matrix of data



posint; index of the 'key' column



list of Vectors or other ordered 1-dimensional data sets; all data sets should have the same number of elements



(optional) equation(s) of the form option=value where option is one of bounds, output, or ignore; specify options for the SplitByColumn function



The SplitByColumn function splits a Matrix of data into a list of submatrices. Data placement into a submatrix is decided on a row by row basis, depending on the value in a particular column (the 'key' column).


This is useful if the values in any one row are related, and data in any one column are of the same type. This type of data is sometimes called a Matrix data set. For example, data about house sales in a region may be organized in three columns giving the price, number of bedrooms, and surface area of each house, where the three values belonging to any one sale are in the same row. This function allows you to separate your data into submatrices with all data for sales involving one bedroom houses, two bedroom houses, etc. Many other Statistics commands can then find properties for these submatrices individually. For more details and a list of applicable Statistics commands, see Matrix data set.


Alternatively, if your data is organized into a number of Vectors for each of the types of data (corresponding to the columns in the Matrix case), you can also submit it as a list of Vectors. In this case, n is the index of the 'key' Vector in the list L of Vectors.



The options argument can contain one or more of the options shown below.


bounds - By default, SplitByColumn creates a separate submatrix for every different value in the nth column of M. This may not be what you want, especially if your key data is more or less continuous (such as the house prices in the example above). In this case, you can use the bounds option to specify boundaries of 'bins' into which values are collected. There are several different ways to use this option:


bounds=none is the default and leads to a separate 'bin' for every value in the key column.


bounds=b0,b1,...,bn means values from the key column will be divided into n bins: from b0 to b1, from b1 to b2, etc. Values less than b0 or greater than bn, and the rows containing them, are discarded.


These intervals are closed on the left and open on the right; that is, if the exact value b1 occurs in the data, then that row goes into the second matrix, not the first one. The rightmost bin is an exception: it represents an interval closed on both sides.


Instead of using a list b0,b1,...,bn, you can also use a Vector with the same entries. This has the same effect.


bounds=quantilesq0,q1,...,qn or bounds=decilesd0,d1,...,dn or bounds=percentilesp0,p1,...,pn or bounds=quartilesq0,q1,...,qn are equivalent ways of specifying bins similar to specifying a list, but instead of specifying absolute numbers for the bounds, you specify which fraction of the data should fall in each of the bins.


For example, specifying bounds=percentiles0,25,60,90 specifies three bins, the first of which contains the lowest 25% of data, then then next 60% - 25% = 35% of data, and finally the following 90% - 60% = 30% of data. That is, the bounds are at the 0th, 25th, 60th, and 90th Percentile of the data in the key column. This could also be specified as bounds=quantiles0,0.25,0.6,0.9 using the notion of Quantile instead of Percentile, or bounds=deciles0,2.5,6,9 using Deciles, or bounds=quartile0,1,2.4,3.6 using Quartiles.


output = default or vectorlist or matrix can be used to override the output format for each submatrix: a Matrix or a list of column Vectors. The default is to return the same type of data as was submitted. That is, for the calling sequence involving M, SplitByColumn returns a list of matrices by default, and for the calling sequence involving L, it returns a list of lists of Vectors. This can be overridden with this option.


In either case, the internal computations occur with a Matrix, so input and/or output of a list of Vectors costs some time and memory for conversion.


ignore = true or false specifies how undefined values in the key column should be treated. With ignore = true, all undefined values (and the corresponding rows) will be removed from the data set.  With ignore = false (the default), the behavior of SplitByColumn depends on the value of bounds: with bounds = none, any occurring undefined values in the key column will give rise to a submatrix with undefined values. If multiple different values of type,undefined occur, then every value gets its own submatrix. With all other settings of bounds, undefined values will end up in an arbitrary submatrix.



We construct a Matrix with housing data. The first column has number of bedrooms, the second has number of square feet, the third has price.


HouseSalesData:= 15 x 3 MatrixData Type: anythingStorage: rectangularOrder: Fortran_order


We can create box plots of the price for subgroups of sales defined by number of bedrooms.





BoxPlotPricesByRooms,'deciles=false','datasetlabels'=2 bdrms,3 bdrms,4 bdrms,'color'=Red,Purple,Blue


Looking at this plot, we see that there is one house that is especially small and cheap, and two that are especially large and expensive, and the rest are clustered fairly closely together. Let's split that cluster into two subgroups according to their area. There is a small gap around 950.








We can compute the means of the columns in an individual data set by using the Matrix data set calling sequence of Mean. To apply it to the list of Matrix data sets returned by SplitByColumn, we can use the elementwise version of Mean, obtained by appending a tilde.







We can also split the data into four roughly equally large groups by price. The output is lists of vectors this time.




Let's examine the areas of these houses.




BoxPlotAreasByPrice,'deciles=false','datasetlabels'=1st quartile,2nd quartile,3rd quartile,4th quartile,'color'=Red..Yellow

There are undefined values in the following Matrix. If we do not supply a bounds argument, then the undefined values in the third column are grouped into a separate bin.







If we do supply a bounds argument, then the result is not well-defined. In this case, the row with value 9 in the key column disappears, and undefined values are placed in the first half of data.




With the ignore option, the undefined values are removed from the key column. Note that the undefined value in the second column remains untouched.




See Also

Statistics, Statistics[Computation], Statistics[Sort], Statistics[Tally], Statistics[TallyInto]

Download Help Document

Was this information helpful?

Please add your Comment (Optional)
E-mail Address (Optional)
What is ? This question helps us to combat spam