Subsets of DataFrames - Maple Programming Help

Online Help

All Products    Maple    MapleSim


Home : Support : Online Help : Applications and Example Worksheets : Statistics and Data Analysis : examples/DataFrame/Subsets

Subsets of DataFrames

Subsetting is an important component of data manipulation. The DataFrame has a powerful indexable structure that makes it possible to access subsets of data that meet given criteria.

The following example worksheet gives several examples for subsetting data frames.

restart;

interface(rtablesize = 15):

Removing Variables or Observations by Indexing

A common operation when subsetting data frames is simply to remove certain rows or columns.

To begin, we load the canada_crimes data set. This data set features six variables and 13 rows of observations on aggregated crime statistics per 100,000 people collected in 2014.

data := Import( FileTools:-JoinPath( [ kernelopts(datadir), "datasets", "canada_crimes.csv" ] ) );

data:=Violent CrimeProperty CrimeOther Criminal CodeCriminal Code TrafficFederal StatuteNewfoundland and Labrador1276.153317.031010.67348.97267.94Prince Edward Island824.433294.3572.18348.64215.34Nova Scotia1241.053307.85902.76368.42375.11New Brunswick1164.322611.17712.02298.71283.45Quebec940.522100.84450.29511.18314.74Ontario786.622292.66476.48211.57258.15Manitoba1712.974311.481689.72276.28362.78Saskatchewan1963.465627.552913.78886.34692.9..................

(1)

The variables for this DataFrame are:

ColumnLabels( data );

Violent Crime,Property Crime,Other Criminal Code,Criminal Code Traffic,Federal Statute

(2)

In order to return subsets of the DataFrame, we can simply index into the DataFrame. For example, to return the subset of data contained in the "Violent Crime" DataSeries:

data[ `Violent Crime` ];

Newfoundland and Labrador1276.15Prince Edward Island824.43Nova Scotia1241.05New Brunswick1164.32Quebec940.52Ontario786.62Manitoba1712.97......

(3)

It is also possible to index the columns of the DataFrame using the integer index values. The following returns the second and third columns:

data[ 2..3 ];

Property CrimeOther Criminal CodeNewfoundland and Labrador3317.031010.67Prince Edward Island3294.3572.18Nova Scotia3307.85902.76New Brunswick2611.17712.02Quebec2100.84450.29Ontario2292.66476.48Manitoba4311.481689.72Saskatchewan5627.552913.78.........

(4)

The following returns the second, third, and fifth columns:

data[ [ 2, 3, 5 ] ];

Property CrimeOther Criminal CodeFederal StatuteNewfoundland and Labrador3317.031010.67267.94Prince Edward Island3294.3572.18215.34Nova Scotia3307.85902.76375.11New Brunswick2611.17712.02283.45Quebec2100.84450.29314.74Ontario2292.66476.48258.15Manitoba4311.481689.72362.78Saskatchewan5627.552913.78692.9............

(5)

It is also possible to subset the data by indexing the DataFrame by certain rows. The following returns the row of observations corresponding for "Ontario":

data[ Ontario, .. ];

Violent Crime786.62Property Crime2292.66Other Criminal Code476.48Criminal Code Traffic211.57Federal Statute258.15

(6)

Similar to indexing by columns, it is also possible to use the row numbers:

data[ 2, .. ];

Violent Crime824.43Property Crime3294.3Other Criminal Code572.18Criminal Code Traffic348.64Federal Statute215.34

(7)

Filtering Observations

While the index notation for subsetting data frames is powerful for retrieving observations in known rows or columns, it is often more desirable to return rows corresponding to observations that meet a given criteria. For example, say we wanted to known which Canadian province or territory has a "Criminal Code traffic" rate that is greater than 500 per 100,000.

To begin, we will return the "Criminal Code Traffic" column and simply read off the corresponding rows:

data[ `Criminal Code Traffic` ];

Newfoundland and Labrador348.97Prince Edward Island348.64Nova Scotia368.42New Brunswick298.71Quebec511.18Ontario211.57Manitoba276.28......

(8)

Now this approach is fine for smaller data frames, but it is much easier to simply query a DataFrame using an element-wise logical operator to first see which (if any) observations match the criteria:

data[ `Criminal Code Traffic` ] >~ 500;

Newfoundland and LabradorfalsePrince Edward IslandfalseNova ScotiafalseNew BrunswickfalseQuebectrueOntariofalseManitobafalse......

(9)

This returns a truth table, whose entries return a true, false, or fail result depending on if the given observation meets the criteria. In addition, if the DataFrame is indexed by a truth table, a filtered subset is returned: In addition, if the DataFrame is indexed by a truth table, a filtered subset is returned:

data[ data[ `Criminal Code Traffic` ] >~ 500 ];

Violent CrimeProperty CrimeOther Criminal CodeCriminal Code TrafficFederal StatuteQuebec940.522100.84450.29511.18314.74Saskatchewan1963.465627.552913.78886.34692.9Yukon4546.79353.610019.171689.951013.42Northwest Territories6911.4923171.2613834.451535.891331.87Nunavut7934.9513778.878902.56639.61759.87

(10)

The with command is useful for simplifying the syntax for querying DataFrames. with creates named variables corresponding to each of the column labels in a given DataFrame.

with( data );

Violent Crime,Property Crime,Other Criminal Code,Criminal Code Traffic,Federal Statute

(11)

Each column of the DataFrame can be called using its variable name:

`Criminal Code Traffic`;

Newfoundland and Labrador348.97Prince Edward Island348.64Nova Scotia368.42New Brunswick298.71Quebec511.18Ontario211.57Manitoba276.28......

(12)

With bound labels, the following returns the rows where the "Federal Statute" rate is less than or equal to 300 per 100,000:

`Federal Statute` <=~ 300;

Newfoundland and LabradortruePrince Edward IslandtrueNova ScotiafalseNew BrunswicktrueQuebecfalseOntariotrueManitobafalse......

(13)

data[ `Federal Statute` <=~ 300 ];

Violent CrimeProperty CrimeOther Criminal CodeCriminal Code TrafficFederal StatuteNewfoundland and Labrador1276.153317.031010.67348.97267.94Prince Edward Island824.433294.3572.18348.64215.34New Brunswick1164.322611.17712.02298.71283.45Ontario786.622292.66476.48211.57258.15

(14)

It is also possible to filter the DataFrame using multiple queries. When combining queries, the logical operators and and or are used to find either the intersection or union of truth tables, respectively. For example, the following returns the province or territory with "Violent Crime" less than 1000 and "Property Crime" greater than 3000.

`Violent Crime` <~ 1000 and `Property Crime` >~ 3000;

Newfoundland and LabradorfalsePrince Edward IslandtrueNova ScotiafalseNew BrunswickfalseQuebecfalseOntariofalseManitobafalse......

(15)

From the truth table, only Prince Edward Island matches this criteria.

data [ `Violent Crime` <~ 1000 and `Property Crime` >~ 3000 ];

Violent CrimeProperty CrimeOther Criminal CodeCriminal Code TrafficFederal StatutePrince Edward Island824.433294.3572.18348.64215.34

(16)

It can be useful to find the union of queries by using the or logical operator. For example, the following returns observations for which the "Other Criminal Code" rate is greater than 2500 per 100,000 or the observations for which the "Criminal Code Traffic" rate is greater than 500 per 100,000:

data [ `Other Criminal Code` >~ 2500 or `Criminal Code Traffic` >~ 500 ];

Violent CrimeProperty CrimeOther Criminal CodeCriminal Code TrafficFederal StatuteQuebec940.522100.84450.29511.18314.74Saskatchewan1963.465627.552913.78886.34692.9Yukon4546.79353.610019.171689.951013.42Northwest Territories6911.4923171.2613834.451535.891331.87Nunavut7934.9513778.878902.56639.61759.87

(17)

See Also

DataFrame,Guide