A Guide to Data Frames - Maple Programming Help

Online Help

All Products    Maple    MapleSim


Home : Support : Online Help : Statistics and Data Analysis : DataFrames and DataSeries : DataFrame/Guide

A Guide to Data Frames

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, truefalse, etc., which makes data frames an ideal storage device for heterogeneous data.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index.

The following example page shows a number of common tasks when working with data frames.

 

Creating a new DataFrame

Indexing entries in data frames

Changing values in a DataFrame

Details on DataFrame structure

How to change column names

Adding a new column or row

Dealing with missing values

Removing duplicates

Reordering columns

Reordering rows

Filtering data in a DataFrame

What does with do?

Reshaping data frames

Applying a function to columns of a DataFrame

Converting a DataFrame to other data types

Creating an empty DataFrame

Importing data into a DataFrame

More examples

Creating a new DataFrame

Data frames are rich containers for information that can store multiple types of data. For example, the following vectors contain information on three berries: Raspberries, Grapes and Strawberries respectively.

Variables:

Genus

  

Type: string

genus := < "Rubus", "Vitis", "Fragaria" >:

Nutrition: Energy per 100g in kJ

  

Type: integer

energy := < 220, 288, 136 >:

Nutrition: Carbohydrates per 100g in g

  

Type: float

carbohydrates := < 11.94, 18.1, 7.68 >:

Tons produced worldwide in 2011

  

Type: integer

total_tons := < 543421, 58500118, 4594539 >:

Top producing country in 2011

  

Type: name

top_producer := < Russia, China, USA >:

A DataFrame can combine all of these different types of data into one structure.

DF := DataFrame( < energy | carbohydrates | total_tons | top_producer | genus > );

DF:=12345122011.94543421RussiaRubus228818.158500118ChinaVitis31367.684594539USAFragaria

(1)

It can be observed that in the above DataFrame, the row and column names are equivalent to the index value. The row and column names can be specified using the rows and columns options respectively. It is also useful to specify the types of data in each DataSeries using the datatypes option.

DF := DataFrame( < energy | carbohydrates | total_tons | top_producer | genus >,
            columns = [ Energy, Carbohydrates, `Total Tons`, `Top Producer`, Genus ],
            rows = [ Raspberry, Grape, Strawberry ],
            datatypes = [ integer, float, integer, anything, string ] );

DF:=EnergyCarbohydratesTotal TonsTop ProducerGenusRaspberry22011.9400000000000543421RussiaRubusGrape28818.100000000000058500118ChinaVitisStrawberry1367.680000000000004594539USAFragaria

(2)

Note: It is important to remember that any vectors included in the DataFrame should have the same length. If the vectors are of different length, an error is returned.

Indexing entries in data frames

Unlike matrices, data frames can be indexed using the column or row names. For example, to view the energy in kJ for raspberries:

DF[Raspberry,Energy];

220

(3)

DF[1,1];

220

(4)

Note that above, the index follows the convention [ row, column ] and a single value is retrieved. If a DataFrame is indexed by a single argument, this indexes by [ column ] and the respective column is returned. For example, to retrieve values for all berries for the Energy variable:

DF[Energy];

Raspberry220Grape288Strawberry136

(5)

DF[1];

Raspberry220Grape288Strawberry136

(6)

To return an entire row, for example, the entire row for Raspberries as a DataSeries:

DF[Raspberry,..];

Energy220Carbohydrates11.9400000000000Total Tons543421Top ProducerRussiaGenusRubus

(7)

DF[1,..];

Energy220Carbohydrates11.9400000000000Total Tons543421Top ProducerRussiaGenusRubus

(8)

For more examples, see Indexing a DataFrame or the Subsets of DataFrames example worksheet.

Changing values in a DataFrame

Changing values in an existing DataFrame can be done in a similar way to other data structures in Maple; simply assign to the index position that you want to change. For example, to change the value for Carbohydrates for Grape:

DF[Grape, Carbohydrates] := 20:

DF;

EnergyCarbohydratesTotal TonsTop ProducerGenusRaspberry22011.9400000000000543421RussiaRubusGrape28820.58500118ChinaVitisStrawberry1367.680000000000004594539USAFragaria

(9)

Before moving on, we can set this back to its original value:

DF[2,2] := 18.1:

DF;

EnergyCarbohydratesTotal TonsTop ProducerGenusRaspberry22011.9400000000000543421RussiaRubusGrape28818.100000000000058500118ChinaVitisStrawberry1367.680000000000004594539USAFragaria

(10)

Details on DataFrame structure

There are many useful commands for querying information about the structure of data frames. For example, numelems returned the total number of elements in the DataFrame:

numelems( DF );

15

(11)

The upperbound command returns the dimensions of the DataFrame.

upperbound( DF );

3&comma;5

(12)

To get the number of rows in the DataFrame, put 1 as the second argument to upperbound:

upperbound( DF, 1 );

3

(13)

To get the number of columns in the DataFrame, put 2 as the second argument to upperbound:

upperbound( DF, 2 );

5

(14)

The Datatypes command returns a list of the data type of each of the columns of data:

Datatypes( DF );

integer&comma;float&comma;integer&comma;anything&comma;string

(15)

How to change column names

Data frames can have both row and column names. In the above example, these were added to the DataFrame using the columns and rows options. To retrieve the current list of column names, the ColumnLabels command can be used:

ColumnLabels( DF );

Energy&comma;Carbohydrates&comma;Total Tons&comma;Top Producer&comma;Genus

(16)

The SubsColumnLabel command returns a DataFrame with a modified column label:

DF := SubsColumnLabel( DF, 4, Country );

DF:=EnergyCarbohydratesTotal TonsCountryGenusRaspberry22011.9400000000000543421RussiaRubusGrape28818.100000000000058500118ChinaVitisStrawberry1367.680000000000004594539USAFragaria

(17)

Adding a new column or row

The Append command appends another column or row to an existing DataFrame. The arguments for the Append command specify the DataFrame to append to, the data to append, and optionally, the orientation and the name for the new data (which can be a DataFrame or DataSeries). The following adds a new column to the DataFrame corresponding to truefalse values for if the given berry is a botanical berry or not:

DF := Append( DF, < false, true, false >, label = `Botanical Berry` );

DF:=EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseGrape28818.100000000000058500118ChinaVitistrueStrawberry1367.680000000000004594539USAFragariafalse

(18)

If we have a new row of information on blackberries such as:

NewRow := DataSeries( < 180 | 9.61 | Mexico | "Rubus" | false >,
            labels = [ Energy, Carbohydrates, Country, Genus, `Botanical Berry` ] );

NewRow:=Energy180Carbohydrates9.61CountryMexicoGenusRubusBotanical Berryfalse

(19)

This row can be added to the existing DataFrame using the Append command.

DF := Append( DF, NewRow, mode = row, label = Blackberry );

DF:=EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseGrape28818.100000000000058500118ChinaVitistrueStrawberry1367.680000000000004594539USAFragariafalseBlackberry1809.61undefinedMexicoRubusfalse

(20)

Note: The value of the Total tons produced in 2011 is unknown for this DataSeries. When this is appended to the original DataFrame, since there is no value for Total Tons, the value undefined is added where the data is unknown. For more on missing values, see the Dealing with missing values section.

Dealing with missing values

Some datasets may have missing values. In the DataFrame for blackberries, it can be observed that the value for Total Tons is undefined:

DF;

EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseGrape28818.100000000000058500118ChinaVitistrueStrawberry1367.680000000000004594539USAFragariafalseBlackberry1809.61undefinedMexicoRubusfalse

(21)

There are several commands for dealing with missing values in data frames including DropMissing and FillMissing.

FillMissing can be used to fill missing values with another value. For example, here we can fill the undefined value in the DataFrame with 0:

FillMissing( DF, 0 );

EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseGrape28818.100000000000058500118ChinaVitistrueStrawberry1367.680000000000004594539USAFragariafalseBlackberry1809.610MexicoRubusfalse

(22)

Note: In order to permanently change the value for any missing values, the result of the FillMissing command must be assigned to the DataFrame.

Removing duplicates

Duplicate rows (or duplicate entries in a given column) do occur in many datasets. Say one was to mistakenly add another row to our DataFrame that contains the same information about blackberries, but the row is added under a misspelled label:

DF := Append( DF, DataFrame( << 180 | 9.61 | Mexico | "Rubus" | false >>,
           columns = [ Energy, Carbohydrates, Country, Genus, `Botanical Berry` ],
           rows = [ Blackbery ] ) );

DF:=EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseGrape28818.100000000000058500118ChinaVitistrueStrawberry1367.680000000000004594539USAFragariafalseBlackberry1809.61undefinedMexicoRubusfalseBlackbery1809.61undefinedMexicoRubusfalse

(23)

By inspecting the DataFrame above, it is easy to see the duplicate entry, but duplicates may not be easy to find by inspection when dealing with larger DataFrames.

The AreDuplicate command returns a DataSeries of truefalse values that indicate if a row has a duplicate or not.

AreDuplicate( DF );

RaspberryfalseGrapefalseStrawberryfalseBlackberrytrueBlackberytrue

(24)

This result can be used to index a DataFrame to show just the duplicate rows:

DF[ AreDuplicate( DF ) ];

EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryBlackberry1809.61undefinedMexicoRubusfalseBlackbery1809.61undefinedMexicoRubusfalse

(25)

The AreUnique command can be used to show the unique rows in a DataFrame. If a DataFrame is indexed by the results, the duplicate rows are removed. By default, the AreUnique command marks the first instance of a row as unique and any subsequent instance of a matching row as duplicate.

DF[ AreUnique( DF ) ];

EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseGrape28818.100000000000058500118ChinaVitistrueStrawberry1367.680000000000004594539USAFragariafalseBlackberry1809.61undefinedMexicoRubusfalse

(26)

The keep option controls if the first, last, or none of the duplicates is marked as unique.

DF[ AreUnique( DF, keep = last ) ];

EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseGrape28818.100000000000058500118ChinaVitistrueStrawberry1367.680000000000004594539USAFragariafalseBlackbery1809.61undefinedMexicoRubusfalse

(27)

DF := DF[ AreUnique( DF, keep = first ) ];

DF:=EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseGrape28818.100000000000058500118ChinaVitistrueStrawberry1367.680000000000004594539USAFragariafalseBlackberry1809.61undefinedMexicoRubusfalse

(28)

Reordering columns

In the How to change column names section, the ColumnLabels command was used to return the list of variable names. The ColumnLabels command also returns the current order of the columns:

ColumnLabels( DF );

Energy&comma;Carbohydrates&comma;Total Tons&comma;Country&comma;Genus&comma;Botanical Berry

(29)

To reorder columns in a DataFrame, use the DataFrame index notation to specify the new order for the columns and reassign this to the existing DataFrame variable.

DF := DF[ [ Genus, Carbohydrates, Energy, Country, `Total Tons`, `Botanical Berry` ] ];

DF:=GenusCarbohydratesEnergyCountryTotal TonsBotanical BerryRaspberryRubus11.9400000000000220Russia543421falseGrapeVitis18.1000000000000288China58500118trueStrawberryFragaria7.68000000000000136USA4594539falseBlackberryRubus9.61180Mexicoundefinedfalse

(30)

Note: It is not recommended that DataFrames with strictly numeric indexed column labels are reordered. This may lead to a conflict between the column label and the index position.

Reordering rows

In the Reordering columns section, the ColumnLabels command was used to return the current order of the columns. The RowLabels command returns the current order of the rows:

RowLabels( DF );

Raspberry&comma;Grape&comma;Strawberry&comma;Blackberry

(31)

To reorder rows in a DataFrame, use the DataFrame index notation to specify the new order for the rows and reassign this to the existing DataFrame variable.

DF := DF[ [ Blackberry, Grape, Raspberry, Strawberry ], .. ];

DF:=GenusCarbohydratesEnergyCountryTotal TonsBotanical BerryBlackberryRubus9.61180MexicoundefinedfalseGrapeVitis18.1000000000000288China58500118trueRaspberryRubus11.9400000000000220Russia543421falseStrawberryFragaria7.68000000000000136USA4594539false

(32)

Note: It is not recommended that DataFrames with strictly numeric indexed row labels are reordered. This may lead to a conflict between the row label and the index position.

Filtering data in a DataFrame

A very useful aspect of data frames is that they can be queried for subsets of the DataFrame that match a given query. Queries on data frames return a truth table whose entries are either true, false or FAIL depending on if a given element meets a given criteria.

For example, to return all of the berries that have greater than 10 g of carbohydrates per 100g:

DF[ Carbohydrates ] >~ 10;

BlackberryfalseGrapetrueRaspberrytrueStrawberryfalse

(33)

The truth table is a DataSeries that can be used to index the main DataFrame. If the DataFrame is indexed by a DataSeries of type truefalse, it returns a filtered DataFrame:

DF[ DF[ Carbohydrates ] >~ 10 ];

GenusCarbohydratesEnergyCountryTotal TonsBotanical BerryGrapeVitis18.1000000000000288China58500118trueRaspberryRubus11.9400000000000220Russia543421false

(34)

It is possible to use the add command to tally the values of any column of a DataFrame, including the truth table generated using the above query:

add( DF[ Carbohydrates ] >~ 10 );

2false&plus;2true

(35)

This means that for the 4 individuals in the DataFrame, 2 out of 4 have greater then 10 g of carbohydrates per 100 g and 2 out of 4 have less than or equal to 10g of carbohydrates.

 

More advanced queries can be created by combining several queries into one statement using logical operators such as and, or, etc.

For example, which of the berries have less than 10 g of carbohydrates and greater than 150 kJ of energy?

DF[ DF[ Carbohydrates ] <~ 10 and DF[ Energy ] >~ 150 ];

GenusCarbohydratesEnergyCountryTotal TonsBotanical BerryBlackberryRubus9.61180Mexicoundefinedfalse

(36)

For more examples, see the Subsets of DataFrames example page.

What does with do?

The with command creates newly assigned variables that correspond to each of the column names.

with( DF );

Genus&comma;Carbohydrates&comma;Energy&comma;Country&comma;Total Tons&comma;Botanical Berry

(37)

These variables are each a DataSeries that has the same row labels as the parent DataFrame.

Energy;

Blackberry180Grape288Raspberry220Strawberry136

(38)

Binding labels is useful for simplifying the query syntax:

Energy >~ 200;

BlackberryfalseGrapetrueRaspberrytrueStrawberryfalse

(39)

The unwith command unassigns the column names:

unwith( DF );

Energy;

Energy

(40)

Reshaping data frames

The Transpose command returns the transpose of a DataFrame. This turns the variable columns into individual rows and the individual rows into variable columns.

Transpose( DF );

BlackberryGrapeRaspberryStrawberryGenusRubusVitisRubusFragariaCarbohydrates9.6118.100000000000011.94000000000007.68000000000000Energy180288220136CountryMexicoChinaRussiaUSATotal Tonsundefined585001185434214594539Botanical Berryfalsetruefalsefalse

(41)

Applying a function to columns of a DataFrame

Many top-level commands as well as commands in the Statistics package can be applied directly to data frames, however many commands do not work with data frames. For many functions, there simply is not a uniform application that can be sensibly applied to all the DataSeries in a DataFrame. In cases where one wants to apply a function to only a single DataSeries of a DataFrame, the ~ (element-wise) operator can be useful.

For example, say we wanted to round the values in the Carbohydrates column to the nearest integer value, the round command can be applied in the following way:

round~(DF[Carbohydrates]);

Blackberry10Grape18Raspberry12Strawberry8

(42)

To change the values for this column in place, reassign the Carbohydrates column:

DF[Carbohydrates] := round~(DF[Carbohydrates]);

DFCarbohydrates:=Blackberry10Grape18Raspberry12Strawberry8

(43)

It may also be useful to apply a custom function to a DataSeries. For example, the following converts the values in the Energy column to Joules from KJoules.

(x -> x * 1000)~(DF[Energy]);

Blackberry180000Grape288000Raspberry220000Strawberry136000

(44)

The following converts all the country names to strings:

(x -> convert( x, string) )~(DF[Country]);

BlackberryMexicoGrapeChinaRaspberryRussiaStrawberryUSA

(45)

Converting a DataFrame to other data types

It may be beneficial from time to time to convert all columns or some columns of a DataFrame to other data types in order to interact with other commands in the Maple language. The convert command can convert a DataFrame to a Matrix, table, Array or a nested list (by supplying the option nested to a conversion to list).

For example, let us convert the numeric columns of the DataFrame to a Matrix:

convert( DF[ [ Carbohydrates, Energy, `Total Tons` ] ], Matrix );

10180undefined18288585001181222054342181364594539

(46)

The following converts the columns with strings or names data into a nested list:

convert( DF[ [ Genus, Country ] ], list, nested );

Rubus&comma;Vitis&comma;Rubus&comma;Fragaria&comma;Mexico&comma;China&comma;Russia&comma;USA

(47)

Data frames can also be converted to Arrays:

convert( DF[ `Botanical Berry` ], Array );

falsetruefalsefalse

(48)

In most of these cases, the conversion is lossy; the column names and row names are discarded. The only case that attempts to keep the meta-data is the conversion to table, where each table index is the row name for the entry value:

DFtable := convert( DF[ Genus ], table );

DFtable:=tableStrawberry&equals;Fragaria&comma;Grape&equals;Vitis&comma;Raspberry&equals;Rubus&comma;Blackberry&equals;Rubus

(49)

indices( DFtable );

Strawberry&comma;Grape&comma;Raspberry&comma;Blackberry

(50)

entries( DFtable );

Fragaria&comma;Vitis&comma;Rubus&comma;Rubus

(51)

Creating an empty DataFrame

In order to create an empty DataFrame, supply an empty list of lists as the first argument:

NewDF := DataFrame( [[]], columns = [Column1] );

NewDF:=Column1

(52)

Observations can then be added to the DataFrame using the Append command. For more details, see the Adding a new column or row section.

NewDF := Append( NewDF, DataFrame( <<1>>, columns = [Column1] ) );

NewDF:=Column111

(53)

NewDF := Append( NewDF, [2], label = Column2 );

NewDF:=Column1Column2112

(54)

Importing data into a DataFrame

By default, the Import command returns data frames when importing many tabular file formats such as .csv, .tsv, .xls, or .xlsx files.

FilePath := FileTools:-JoinPath( [ kernelopts(datadir), "datasets", "air_passengers.csv" ] ):

Import( FilePath );

Monthly Passengers1949-011121949-021181949-031321949-041291949-051211949-061351949-071481949-08148......

(55)

More examples

There are several examples for working with DataFrames and DataSeries:

• 

Indexing a DataFrame : Examples of selecting, rearranging, and reassigning entries from a DataFrame object

• 

Iris Data : Examples of summarizing data, computing aggregate statistics, and principal component analysis

• 

Subsets of DataFrames : Examples of indexing and filtering columns and rows of a DataFrame

• 

Statistics with DataFrames : Examples of using commands from Statistics on DataFrames


Download Help Document

Was this information helpful?



Please add your Comment (Optional)
E-mail Address (Optional)
What is ? This question helps us to combat spam