DataFrames in Statistics

All Products Maple MapleSim

Home : Support : Online Help : Statistics and Data Analysis : DataFrames and DataSeries : DataFrames in Statistics

DataFrames in Statistics

Description

•	This help page describes how to use Statistics commands on DataFrame objects, and other spreadsheet-type data in matrices, sometimes called Matrix data sets.

•

Many of the data sets you might encounter are two-dimensional in nature. They will have information about a number of items or events; for each item or event, the same properties are known. Such data sets can easily be represented in a DataFrame by having each row correspond to an item and each column to one property of all these items. This is how you would typically store such data in a spreadsheet. You can also store such data in a Matrix, as long as you keep track of labels for the rows and columns yourself.

•	Many commands in the Statistics package can be used with this type of data:

–

The following computational commands can be run on DataFrame objects (or Matrices). They are computed per column and the results are returned in a DataSeries object. The labels for the DataSeries are the column labels of the DataFrame. Alternatively, they are computed per column of a Matrix and the results are returned in a row Vector.

AbsoluteDeviation	CentralMoment	Count	CountMissing
Cumulant	DataSummary	Decile	Detrend
Difference	ExpectedValue	FivePointSummary	GeometricMean
HarmonicMean	HodgesLehmann	InterquartileRange	Kurtosis
Mean	MeanDeviation	Median	MedianDeviation
Mode	Moment	Percentile	QuadraticMean
Quantile	Quartile	Range	RousseeuwCrouxQn
RousseeuwCrouxSn	Scale	Skewness	StandardDeviation
StandardError	StandardizedMoment	TrimmedMean	Variance
Variation	WinsorizedMean

–	The following visualization commands, listed on the Statistics Visualization help page, also accept DataFrame objects. Generally, the row and column labels are used to label data points and data sets, respectively, as appropriate.

AgglomeratedPlot	AreaChart	BarChart	Biplot
BoxPlot	BubblePlot	ColumnGraph	CumulativeSumChart
ErrorPlot	FrequencyPlot	GridPlot	LineChart
ParetoChart	PointPlot	ScatterPlot	ScreePlot
ViolinPlot

–	Statistics[SplitByColumn] and Statistics[Join] split Matrices into submatrices and join them back together.

–	DataFrame/Aggregate does similar things for DataFrame objects.

•	Additional examples are found in the Statistics with DataFrames example worksheet.

Examples

>	$with (Statistics) &colon;$

We construct a DataFrame with housing data. The first column has number of bedrooms, the second has the area in square feet, the third has price.

>	$bedrooms ≔ ⟨3, 4, 2, 4, 3, 2, 2, 3, 4, 4, 2, 4, 4, 3, 3⟩$

$bedrooms ≔ \begin{array}{c} [\begin{array}{c} 3 \\ 4 \\ 2 \\ 4 \\ 3 \\ 2 \\ 2 \\ 3 \\ 4 \\ 4 \\ ⋮ \end{array}] \\ 15 element Vector[column] \end{array}$

(1)

>	$area ≔ ⟨1130, 1123, 1049, 1527, 907, 580, 878, 1075, 1040, 1295, 1100, 995, 908, 853, 856⟩$

$area ≔ \begin{array}{c} [\begin{array}{c} 1130 \\ 1123 \\ 1049 \\ 1527 \\ 907 \\ 580 \\ 878 \\ 1075 \\ 1040 \\ 1295 \\ ⋮ \end{array}] \\ 15 element Vector[column] \end{array}$

(2)

>	$price ≔ ⟨114700, 125200, 81600, 127400, 88500, 59500, 96500, 113300, 104400, 136600, 80100, 128000, 115700, 94700, 89400⟩$

$price ≔ \begin{array}{c} [\begin{array}{c} 114700 \\ 125200 \\ 81600 \\ 127400 \\ 88500 \\ 59500 \\ 96500 \\ 113300 \\ 104400 \\ 136600 \\ ⋮ \end{array}] \\ 15 element Vector[column] \end{array}$

(3)

>	$HouseSalesData ≔ DataFrame ([bedrooms, area, price], columns = [Bedrooms, Area, Price])$

$HouseSalesData ≔ \begin{array}{c} [\begin{array}{c} Bedrooms & Area & Price \\ 1 & 3 & 1130 & 114700 \\ 2 & 4 & 1123 & 125200 \\ 3 & 2 & 1049 & 81600 \\ 4 & 4 & 1527 & 127400 \\ 5 & 3 & 907 & 88500 \\ 6 & 2 & 580 & 59500 \\ 7 & 2 & 878 & 96500 \\ 8 & 3 & 1075 & 113300 \\ 9 & 4 & 1040 & 104400 \\ 10 & 4 & 1295 & 136600 \\ ⋮ & ⋮ & ⋮ & ⋮ \end{array}] \\ 15 x 3 DataFrame \end{array}$

(4)

We can determine the average number of bedrooms, average area, and average price with just the Mean command.

>	$Mean (HouseSalesData)$

$[\begin{array}{c} Bedrooms & 3.13333333333333 \\ Area & 1021.06666666667 \\ Price & 103706.666666667 \end{array}]$

(5)

We can also determine the standard error for this mean.

>	$StandardError (Mean, HouseSalesData)$

$[\begin{array}{c} Bedrooms & 0.215288658199187 \\ Area & 56.0832261373064 \\ Price & 5615.39946140175 \end{array}]$

(6)

Or the 30th percentile for each column.

>	$Percentile (HouseSalesData, 30)$

$[\begin{array}{c} Bedrooms & 2.93333333333333 \\ Area & 905.066666666667 \\ Price & 89340. \end{array}]$

(7)

The GridPlot command can display scatter plots of pairs of columns.

>	$GridPlot (HouseSalesData)$

$Tabulate$

(8)

$Bedrooms$
	$Area$
		$Price$

We can use the lower diagonal entries to display the values for the correlation.

>	$GridPlot (HouseSalesData, lower = Correlation)$

$Tabulate0$

(9)

$Bedrooms$
$0.488597380223581068$	$Area$
$0.835789649694966386$	$0.704398937794964319$	$Price$

We can determine the average area and price for subgroups of sales defined by number of bedrooms. (The Aggregate command is part of the DataFrame object, not the Statistics package, so it is not available for Matrices.)

>	$Aggregate (HouseSalesData, Bedrooms)$

$[\begin{array}{c} Bedrooms & Area & Price \\ 1 & 2 & 901.750000000000 & 79425. \\ 2 & 3 & 964.200000000000 & 100120. \\ 3 & 4 & 1148. & 122883.333333333 \end{array}]$

(10)

To create a box plot of prices for each number of bedrooms requires a little more effort.

>	$split ≔ SplitByColumn (HouseSalesData, Bedrooms)$

$split ≔ [[\begin{array}{c} Bedrooms & Area & Price \\ 3 & 2 & 1049 & 81600 \\ 6 & 2 & 580 & 59500 \\ 7 & 2 & 878 & 96500 \\ 11 & 2 & 1100 & 80100 \end{array}], [\begin{array}{c} Bedrooms & Area & Price \\ 1 & 3 & 1130 & 114700 \\ 5 & 3 & 907 & 88500 \\ 8 & 3 & 1075 & 113300 \\ 14 & 3 & 853 & 94700 \\ 15 & 3 & 856 & 89400 \end{array}], [\begin{array}{c} Bedrooms & Area & Price \\ 2 & 4 & 1123 & 125200 \\ 4 & 4 & 1527 & 127400 \\ 9 & 4 & 1040 & 104400 \\ 10 & 4 & 1295 & 136600 \\ 12 & 4 & 995 & 128000 \\ 13 & 4 & 908 & 115700 \end{array}]]$

(11)

>	$price_split ≔ map (df \mapsto convert (df [Price], Vector), split)$

$price_split ≔ [[\begin{array}{c} 81600 & 59500 & 96500 & 80100 \end{array}], [\begin{array}{c} 114700 & 88500 & 113300 & 94700 & 89400 \end{array}], [\begin{array}{c} 125200 & 127400 & 104400 & 136600 & 128000 & 115700 \end{array}]]$

(12)

>	$BoxPlot (price_split, datasetlabels = [2, 3, 4])$

Most of the things mentioned above can be done with a Matrix, too. Consider the following examples.

>	$HSD_Matrix ≔ convert (HouseSalesData, Matrix)$

$HSD_Matrix ≔ \begin{array}{c} [\begin{array}{c} 3 & 1130 & 114700 \\ 4 & 1123 & 125200 \\ 2 & 1049 & 81600 \\ 4 & 1527 & 127400 \\ 3 & 907 & 88500 \\ 2 & 580 & 59500 \\ 2 & 878 & 96500 \\ 3 & 1075 & 113300 \\ 4 & 1040 & 104400 \\ 4 & 1295 & 136600 \\ ⋮ & ⋮ & ⋮ \end{array}] \\ 15 × 3 Matrix \end{array}$

(13)

>	$Mean (HSD_Matrix)$

$[\begin{array}{c} 3.13333333333333 & 1021.06666666667 & 103706.666666667 \end{array}]$

(14)

>	$Percentile (HSD_Matrix, 30)$

$[\begin{array}{c} 2.93333333333333 & 905.066666666667 & 89340. \end{array}]$

(15)

Some commands have calling sequences where one of the arguments is compared to the data; this is the case for the second argument of AbsoluteDeviation and for the origin parameter of Moment. In these cases, it typically does not make much sense to use the same value for each column, so Maple supports using a list or Vector of values instead. These commands do not yet work directly with DataFrame objects.

>	$AbsoluteDeviation (HSD_Matrix, [3, 1000, 100000])$

$[\begin{array}{c} 0.666666666666667 & 157.466666666667 & 18333.3333333333 \end{array}]$

(16)

>	$StandardError (Moment, HSD_Matrix, 1, origin = [3, 1000, 100000])$

$[\begin{array}{c} 0.207988603676402 & 54.1815439398298 & 5424.99127836814 \end{array}]$

(17)

Maple

Maple Add-Ons

Math Success Platform

Math success in the age of AI

Maple Flow

MapleSim

Consulting Services

Maple T.A. and Möbius

Education

Industries

Automotive and Aerospace

Robotics

Machine Design & Industrial Automation

Other

Application Areas

Product Pricing

Purchasing

Institutional Student Licensing

Maplesoft Elite Maintenance (EMP)

Support

Product Training

Online Product Help

Webinars & Events

Publications

Content Hubs

Examples & Applications

Community

About Maplesoft

Media Center

User Community

Contact

Online Help

All Products Maple MapleSim

Maple

Powerful math software that is easy to use

Maple Add-Ons

Math Success Platform

Math success in the age of AI

Maple Flow

Engineering calculations & documentation

MapleSim

Advanced System Level Modeling

Consulting Services

Maple T.A. and Möbius

Education

Industries

Automotive and Aerospace

Robotics

Machine Design & Industrial Automation

Other

Application Areas

Product Pricing

Purchasing

Institutional Student Licensing

Maplesoft Elite Maintenance (EMP)

Support

Product Training

Online Product Help

Webinars & Events

Publications

Content Hubs

Examples & Applications

Community

About Maplesoft

Media Center

User Community

Contact

Online Help

All Products Maple MapleSim