Subsets of DataFrames
Subsetting is an important component of data manipulation. The DataFrame has a powerful indexable structure that makes it possible to access subsets of data that meet given criteria.
The following example worksheet gives several examples for subsetting data frames.
restart;
interface(rtablesize = 15):
Removing Variables or Observations by Indexing
A common operation when subsetting data frames is simply to remove certain rows or columns.
To begin, we load the canada_crimes data set. This data set features six variables and 13 rows of observations on aggregated crime statistics per 100,000 people collected in 2014.
data := Import( "datasets/canada_crimes.csv", base=datadir );
DataFrame⁡1276.153317.031010.67348.97267.94824.433294.3572.18348.64215.341241.053307.85902.76368.42375.111164.322611.17712.02298.71283.45940.522100.84450.29511.18314.74786.622292.66476.48211.57258.151712.974311.481689.72276.28362.781963.465627.552913.78886.34692.91243.834308.671497.54466.12371.431148.424886.11564.03350.85682.274546.79353.610019.171689.951013.426911.4923171.2613834.451535.891331.877934.9513778.878902.56639.61759.87,rows=Newfoundland and Labrador,Prince Edward Island,Nova Scotia,New Brunswick,Quebec,Ontario,Manitoba,Saskatchewan,Alberta,British Columbia,Yukon,Northwest Territories,Nunavut,columns=Violent Crime,Property Crime,Other Criminal Code,Criminal Code Traffic,Federal Statute
The variables for this DataFrame are:
ColumnLabels( data );
Violent Crime,Property Crime,Other Criminal Code,Criminal Code Traffic,Federal Statute
In order to return subsets of the DataFrame, we can simply index into the DataFrame. For example, to return the subset of data contained in the "Violent Crime" DataSeries:
data[ `Violent Crime` ];
DataSeries⁡1276.15824.431241.051164.32940.52786.621712.971963.461243.831148.424546.76911.497934.95,labels=Newfoundland and Labrador,Prince Edward Island,Nova Scotia,New Brunswick,Quebec,Ontario,Manitoba,Saskatchewan,Alberta,British Columbia,Yukon,Northwest Territories,Nunavut,datatype=anything
It is also possible to index the columns of the DataFrame using the integer index values. The following returns the second and third columns:
data[ 2..3 ];
DataFrame⁡3317.031010.673294.3572.183307.85902.762611.17712.022100.84450.292292.66476.484311.481689.725627.552913.784308.671497.544886.11564.039353.610019.1723171.2613834.4513778.878902.56,rows=Newfoundland and Labrador,Prince Edward Island,Nova Scotia,New Brunswick,Quebec,Ontario,Manitoba,Saskatchewan,Alberta,British Columbia,Yukon,Northwest Territories,Nunavut,columns=Property Crime,Other Criminal Code
The following returns the second, third, and fifth columns:
data[ [ 2, 3, 5 ] ];
DataFrame⁡3317.031010.67267.943294.3572.18215.343307.85902.76375.112611.17712.02283.452100.84450.29314.742292.66476.48258.154311.481689.72362.785627.552913.78692.94308.671497.54371.434886.11564.03682.279353.610019.171013.4223171.2613834.451331.8713778.878902.56759.87,rows=Newfoundland and Labrador,Prince Edward Island,Nova Scotia,New Brunswick,Quebec,Ontario,Manitoba,Saskatchewan,Alberta,British Columbia,Yukon,Northwest Territories,Nunavut,columns=Property Crime,Other Criminal Code,Federal Statute
It is also possible to subset the data by indexing the DataFrame by certain rows. The following returns the row of observations corresponding for "Ontario":
data[ Ontario, .. ];
DataSeries⁡786.622292.66476.48211.57258.15,labels=Violent Crime,Property Crime,Other Criminal Code,Criminal Code Traffic,Federal Statute,datatype=anything
Similar to indexing by columns, it is also possible to use the row numbers:
data[ 2, .. ];
DataSeries⁡824.433294.3572.18348.64215.34,labels=Violent Crime,Property Crime,Other Criminal Code,Criminal Code Traffic,Federal Statute,datatype=anything
Filtering Observations
While the index notation for subsetting data frames is powerful for retrieving observations in known rows or columns, it is often more desirable to return rows corresponding to observations that meet a given criteria. For example, say we wanted to known which Canadian province or territory has a "Criminal Code traffic" rate that is greater than 500 per 100,000.
To begin, we will return the "Criminal Code Traffic" column and simply read off the corresponding rows:
data[ `Criminal Code Traffic` ];
DataSeries⁡348.97348.64368.42298.71511.18211.57276.28886.34466.12350.851689.951535.89639.61,labels=Newfoundland and Labrador,Prince Edward Island,Nova Scotia,New Brunswick,Quebec,Ontario,Manitoba,Saskatchewan,Alberta,British Columbia,Yukon,Northwest Territories,Nunavut,datatype=anything
Now this approach is fine for smaller data frames, but it is much easier to simply query a DataFrame using an element-wise logical operator to first see which (if any) observations match the criteria:
data[ `Criminal Code Traffic` ] >~ 500;
DataSeries⁡falsefalsefalsefalsetruefalsefalsetruefalsefalsetruetruetrue,labels=Newfoundland and Labrador,Prince Edward Island,Nova Scotia,New Brunswick,Quebec,Ontario,Manitoba,Saskatchewan,Alberta,British Columbia,Yukon,Northwest Territories,Nunavut,datatype=truefalseFAIL
This returns a truth table, whose entries return a true, false, or FAIL result depending on if the given observation meets the criteria. In addition, if the DataFrame is indexed by a truth table, a filtered subset is returned:
data[ data[ `Criminal Code Traffic` ] >~ 500 ];
DataFrame⁡940.522100.84450.29511.18314.741963.465627.552913.78886.34692.94546.79353.610019.171689.951013.426911.4923171.2613834.451535.891331.877934.9513778.878902.56639.61759.87,rows=Quebec,Saskatchewan,Yukon,Northwest Territories,Nunavut,columns=Violent Crime,Property Crime,Other Criminal Code,Criminal Code Traffic,Federal Statute
The with command is useful for simplifying the syntax for querying DataFrames. with creates named variables corresponding to each of the column labels in a given DataFrame.
with( data );
Each column of the DataFrame can be called using its variable name:
`Criminal Code Traffic`;
With bound labels, the following returns the rows where the "Federal Statute" rate is less than or equal to 300 per 100,000:
`Federal Statute` <=~ 300;
DataSeries⁡truetruefalsetruefalsetruefalsefalsefalsefalsefalsefalsefalse,labels=Newfoundland and Labrador,Prince Edward Island,Nova Scotia,New Brunswick,Quebec,Ontario,Manitoba,Saskatchewan,Alberta,British Columbia,Yukon,Northwest Territories,Nunavut,datatype=truefalseFAIL
data[ `Federal Statute` <=~ 300 ];
DataFrame⁡1276.153317.031010.67348.97267.94824.433294.3572.18348.64215.341164.322611.17712.02298.71283.45786.622292.66476.48211.57258.15,rows=Newfoundland and Labrador,Prince Edward Island,New Brunswick,Ontario,columns=Violent Crime,Property Crime,Other Criminal Code,Criminal Code Traffic,Federal Statute
It is also possible to filter the DataFrame using multiple queries. When combining queries, the logical operators and and or are used to find either the intersection or union of truth tables, respectively. For example, the following returns the province or territory with "Violent Crime" less than 1000 and "Property Crime" greater than 3000.
`Violent Crime` <~ 1000 and `Property Crime` >~ 3000;
DataSeries⁡falsetruefalsefalsefalsefalsefalsefalsefalsefalsefalsefalsefalse,labels=Newfoundland and Labrador,Prince Edward Island,Nova Scotia,New Brunswick,Quebec,Ontario,Manitoba,Saskatchewan,Alberta,British Columbia,Yukon,Northwest Territories,Nunavut,datatype=truefalseFAIL
From the truth table, only Prince Edward Island matches this criteria.
data [ `Violent Crime` <~ 1000 and `Property Crime` >~ 3000 ];
DataFrame⁡824.433294.3572.18348.64215.34,rows=Prince Edward Island,columns=Violent Crime,Property Crime,Other Criminal Code,Criminal Code Traffic,Federal Statute
It can be useful to find the union of queries by using the or logical operator. For example, the following returns observations for which the "Other Criminal Code" rate is greater than 2500 per 100,000 or the observations for which the "Criminal Code Traffic" rate is greater than 500 per 100,000:
data [ `Other Criminal Code` >~ 2500 or `Criminal Code Traffic` >~ 500 ];
See Also
DataFrame,Guide
Download Help Document