updates/Maple17/Performance - Help

Online Help

All Products    Maple    MapleSim


Home : Support : Online Help : updates/Maple17/Performance

Performance

 

Improvements in performance and efficiency:

• 

Elementary Functions

• 

Multiple Memory Regions

• 

Parallel Linear Algebra

• 

Parallel Garbage Collector

• 

Polynomial Arithmetic 

• 

Sparse Matrices and Vectors

With every new release, Maplesoft strives to improve the efficiency and speed of its mathematical computations.  This involves making improvements in the most frequently called routines and algorithms, as well as in the low-level infrastructure.

 

Maple 17 features new algorithms for faster numerical calculations with complex numbers and new low-level routines that make sparse matrix and vector concatenation much more practical.  Updates to optimized BLAS and LAPACK functions for floating-point matrices and vectors make matrix computations such as Cholesky decomposition and eigenvalues, much quicker on multi-core systems.

 

Memory usage in Maple is much improved with the combination of multiple memory regions, which reduce fragmentation and make much more efficient use of overall memory, and the new parallel garbage collector, which helps to efficiently release memory that is no longer used.

 

Lastly, numerous improvements in underlying routines combined with a new high performance data structure for distributed multivariate polynomials drastically improves the speed and scalability of most polynomial computations.

Elementary Functions

Hardware algorithms have been introduced into Maple 17 that greatly speed up operations on complex floating-point numbers. Compared to Maple 16, the Maple 17 is almost 2000 times faster in some cases.  

 

To demonstrate this, the following commands create a random vector, V, containing 10,000 complex floating-point numbers. The time (in seconds) needed to perform operations on all 10,000 elements is found using various functions.

Example

Benchmarks computed on an Intel Q8200 2.33GHz Quad-Core CPU running Linux.

V:=LinearAlgebra:-RandomVector104,datatype=complex8:

timerealln~V;

0.002

(1.1)

timereallog~V;

0.002

(1.2)

timereallog10~V;

0.005

(1.3)

timerealsqrt~V;

0.001

(1.4)

timerealarcsin~V;

0.003

(1.5)

timerealarcsinh~V;

0.027

(1.6)

timerealarctan~V;

0.003

(1.7)

timerealarctanh~V;

0.003

(1.8)

timerealarccos~V;

0.003

(1.9)

timerealarccosh~V;

0.003

(1.10)

 

The following lists, m16 and m17, contain the measured times to compute these operations using Maple 16 and 17, respectively. The column graph plots these lists to show the speed up between the two versions.

m16:=2.557,2.914,3.162,2.020,5.193,5.417,2.810,2.972,5.655,6.300:

m17:=0.002,0.002,0.005,0.001,0.003,0.0027,0.003,0.003,0.003,0.003:

Statistics:-ColumnGraphm16,m17,datasetlabels=ln,log,log10,sqrt,arcsin,arcsinh,arctan,arctanh,arccos,arccosh,legend=Maple 16,Maple 17,labels=,time (s),title=Computation Times,titlefont=TIMES, BOLD, 18;

 

 

Multiple Memory Regions

A significant improvement to Maple's memory allocator is the addition of multiple memory regions.

 

Benefits of multiple memory regions include:

• 

More efficient allocation of memory, leading to more memory being available to solve bigger problems without the need to initially specify the size of the reserved region.

• 

Better resource sharing between different processes running on the same machine, including multiple instances of Maple and multiple Maple kernels launched as parallel grid nodes.

• 

Performance improvements due to less fragmentation and cache locality, as well as performance improvements when running parallel code.

 

Previously, Maple initially reserved a large contiguous heap of memory that was subsequently managed by additional allocations and garbage collection as required. Initially, only a small fraction of this reserved memory was allotted to handle memory allocation requests. Whenever a given allotment was exhausted, a garbage collection would occur to reclaim any unused memory. As the computation required more memory, the memory management system would progressively commit more of the heap to be allocated.

 

With the addition of memory regions, Maple is no longer constrained by the limitation imposed by a single large contiguous memory region (heap).  Instead, Maple is able to incorporate additional regions upon demand. Various types of memory regions are provided: small allocations come from thread-local regions (512 words and less), medium-sized allocations reside in global thread-shared regions (less than 1MB), and finally large blocks of memory are individually allocated as distinct regions of their own. Overall, this allows Maple to more effectively and efficiently manage the available memory resources, and more specifically, this offers improvements to caching via better data locality and reduced fragmentation.

 

In earlier releases of Maple, when a chunk of memory was committed, it remained under Maple's control for the remainder of the session. Handling memory requests larger than 1MB by allocating an individually tailored memory region enables Maple 17 to return this large block of memory back to the operating system when it is no longer needed.

 

 

Example

For example, the following code snippet allocates an array of matrices, does some work, and then clobbers the array thereby rendering the allocated memory garbage.  Notice that memory usage drops after the explicit call to garbage collection.*

 

restart;

size  1024:elems  10:A  Array 1.. elems, 1..elems :for i to elems do    for j to elems do       Ai,  j  Matrix size, size, datatype = float8 :   end do: end do:


Report how many KB Maple is using.

kerneloptsbytesalloc1024;

850704

(2.1)

Clobber the top level array (for brevity we have omitted computing upon the matrices).

A  0:


How much memory is Maple using now?

gc;kerneloptsbytesalloc1024;

88476

(2.2)

* The explicit call to gc() is used here to illustrate a specific behavior of Maple's memory management system. In general, it is better to allow Maple to decide when to initiate a garbage collection cycle.

Parallel Linear Algebra

The access to optimized BLAS and LAPACK functions for floating-point Matrices and Vectors has been updated in Maple 17.

 

On Linux and OS X, this involves updates to the ATLAS (Automatically Tuned Linear Algebra Software) dynamic libraries which are shipped with Maple.  While on 64-bit Windows, Maple 17 has been updated to use Intel's Math Kernel Library (MKL).  

 

Performance enhancements for floating-point Linear Algebra operations include improved use of multiple cores and CPUs.

 

The operations referenced in the plots in this document were all performed on real, floating-point Matrices and Vectors created with the datatype=float[8] option.

Linux (32bit)

Linux 32bit , on Intel Core2 Quad CPU Q8200 @ 2.33GHz

Linux (64bit)

Linux 64bit on Intel Core i7 CPU 920 @ 2.67GHz

OS X

OS X 10.6.6 (64bit), on Intel Xeon (x2) Quad-Core @ 2.63GHz

Windows (64bit)

Windows Server 2003 (64bit), on Intel Xeon (E5335) Quad-Core @ 2.00GHz

 

Parallel Garbage Collector

One of the core components of Maple's engine is the garbage collector. The garbage collector is responsible for finding and reclaiming memory that is no longer needed by the evaluation engine. In Maple 17, the garbage collector is capable of taking advantage of multiple processors to perform its job more quickly. As the garbage collector is used continuously as Maple runs, this speed up helps all users, not just those running parallel algorithms.

 

• 

The garbage collector in Maple 17 is capable of running multiple threads to take advantage of multiple cores.  This can speed up the collector significantly.  Although the collector is only responsible for a fraction of Maple's total running time, this can still lead to running times reduced by up to 10%.  These speed ups require no changes to user code.

• 

There are 2 kernelopts that can control the parallel collector.

– 

gcmaxthreads controls the maximum number of threads that the parallel garbage collector will use.

– 

gcthreadmemorysize controls how many threads Maple will use for a collection.  The number of bytes allocated is divided by the value assigned to gcthreadmemorysize to determine the number of threads.

Example

The following example shows the speed up in the garbage collector from using multiple threads.

restart

if kerneloptswordsize = 32 then    upperBound  210^7;else   upperBound  10^8; end if:

data  seqi, i = 1 .. upperBound:

p:=proc        local i;        for i to 100         do               gc        end do end proc:

kerneloptsgcmaxthreads=1: t11timerealp:

kerneloptsgcmaxthreads=2:t12timerealp:

kerneloptsgcmaxthreads=4:t14timerealp:

Statistics:-ColumnGrapht1[1],t1[2],t1[4],datasetlabels=One Thread,Two Threads,Four Threads,labels=,Seconds,gridlines

 

The garbage collection algorithm only comprises a small amount of the total Maple running time, so the speed up will not be this significant for real Maple computations.  The following example shows a more realistic situation.

restart

d:=seqrandpolyx,y,z,dense,degree=12,i=1..10000:

f:=proc       local i;       mapiinti,x,d end proc:

kerneloptsgcmaxthreads=1:

t21timerealf:

 

kerneloptsgcmaxthreads=2:

t22timerealf:

 

kerneloptsgcmaxthreads=4:

t24timerealf:

 

Statistics:-ColumnGrapht21,t22,t24,datasetlabels=One Thread,Two Threads,Four Threads,labels=,Seconds,gridlines

 

Polynomials with Integer Coefficients

Maple 17 automatically uses a new high performance data structure for distributed multivariate polynomials with integer coefficients.  Together with new algorithms in the Maple kernel, this drastically improves the speed and scalability of most polynomial computations.  

Benchmarks computed in real times on a 64-bit quad core Intel Core i7 920 2.66 GHz.

Overview

Expanding this polynomial with 170544 terms takes 31 milliseconds and 2.6 MB of memory in Maple 17, versus 250 ms and 29 MB in Maple 16. This speedup is entirely due to the reduction in overhead for the new data structure. The external C routine that computes this result has not changed.

 

f:=1+x+y+z+t+u+v+w15:

fCodeTools:-Usageexpandf:

 

Maple 17:

memory used=2.67MiB, alloc change=2.61MiB, cpu time=30.00ms, real time=31.00ms

 

Maple 16:

memory used=23.40MiB, alloc change=29.16MiB, cpu time=242.00ms, real time=250.00ms

 

Mathematica® 9:

0.564 seconds

AbsoluteTiming[ f = Expand[ (1+x+y+z+t+u+v+w)^15 ]; ]

 

 

 

Common queries such as finding the variables, computing the total degree, and testing for a polynomial run in O(1) time for the new data structure.  The effect on Maple library is profound.

 

X:=CodeTools:-Usageindetsf:

dCodeTools:-Usagedegreef:

CodeTools:-Usagetypef,polynom 

Maple 17:

memory used=0.59KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns

memory used=0.59KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns

memory used=432 bytes, alloc change=0 bytes, cpu time=0ns, real time=0ns

 

Maple 16:

memory used=0.66KiB, alloc change=0 bytes, cpu time=76.00ms, real time=77.00ms

memory used=0.72KiB, alloc change=0 bytes, cpu time=119.00ms, real time=119.00ms

memory used=0.55KiB, alloc change=0 bytes, cpu time=53.00ms, real time=54.00ms

 

Mathematica® 9:

0.125 seconds, X = Variables[ f ]

0.166 seconds, d = Exponent[ f, x]  (* degree in x, total degree not builtin *)

4.611 seconds, PolynomialQ[ f ]

 

 

New high performance algorithms were developed for the Maple kernel to exploit the new data structure.  We improved many core operations including differentiation, extracting a coefficient, and evaluating a polynomial.

 

CodeTools:-Usagexf:

CodeTools:-Usagecoeffsf,x:

CodeTools:-Usagefx=2,y=3,z=5|fx=2,y=3,z=5:

 

Maple 17:

memory used=2.60MiB, alloc change=2.61MiB, cpu time=5.00ms, real time=5.00ms

memory used=5.21MiB, alloc change=2.61MiB, cpu time=21.00ms, real time=21.00ms

memory used=2.60MiB, alloc change=2.61MiB, cpu time=26.00ms, real time=26.00ms

 

Maple 16:

memory used=12.36MiB, alloc change=9.72MiB, cpu time=174.00ms, real time=175.00ms

memory used=34.78MiB, alloc change=29.16MiB, cpu time=194.00ms, real time=194.00ms

memory used=48.63MiB, alloc change=34.02MiB, cpu time=476.00ms, real time=478.00ms

 

Mathematica® 9:

1.001 seconds, D[f,x];

1.030 seconds, CoefficientList[f, x];

1.304 seconds, f /. {x->2, y->3, z->5};

 

 

Dense methods have been added to multiply multivariate polynomials in subquadratic time.  Maple 17 automatically selects a sparse or dense algorithm, balancing time and memory use.  Dense algorithms are not parallelized in Maple 17.

 

g:=expandx5+x3+x2+x+120y5+y3+y2+y+120:

pCodeTools:-Usageexpandg+1g+2:

 

Maple 17:

memory used=2.55MiB, alloc change=0.61MiB, cpu time=148.00ms, real time=149.00ms

 

Maple 16:

memory used=4.07MiB, alloc change=0 bytes, cpu time=4.55s, real time=1.19s

 

Mathematica® 9:

1.331 seconds

g = Expand[ (x^5+x^3+x^2+x+1)^20 (y^5+y^3+y^2+y+1)^20 ];

AbsoluteTiming[ p = Expand[ (g+1)*(g+2) ]; ]

 

 

Parallel speedup increases in Maple 17 due to the elimination of sequential overhead and Amdahl's Law.  For the factorization below, Maple 17 uses 26.01 CPU seconds versus 62.24 for Maple 16, so it saves a little more than half the work.  However, all of the time saved was sequential, so Maple 17 actually runs 3.7x faster on the quad core CPU, and parallel speedup increases from 1.3x to 2.0x.

 

f:=expand1+x+y+z+t20:

p:=expandf+1f+2:

CodeTools:-Usagefactorp:

 

Maple 17:

memory used=1.43GiB, alloc change=82.04MiB, cpu time=26.01s, real time=12.90s

 

Maple 16:

memory used=4.29GiB, alloc change=256.41MiB, cpu time=62.24s, real time=47.60s

 

Mathematica® 9:

952.857 seconds

f = Expand[ (1+x+y+z+t)^20 ];

p = Expand[ (f+1)*(f+2) ];

AbsoluteTiming[ Factor[ p ]; ]

 

 

 

Benchmarks

For each problem we expand the input f and g, then we time the multiplication p = fg, the division q=pf, and the factorization of p.

For Maple we timed p := expand(f*g); divide(p,f,'q'); and factor(p); 

For Mathematica® we timed p = Expand[f g]; PolynomialQuotient[p, f, x]; and Factor[p]; except for Problem #5 where Cancel[p/f]; was much faster than PolynomialQuotient.

 

All times are real times in seconds on a 64-bit Core i7 920 2.66GHz Linux system.

Problem #1:  1771 x 1771 = 12341 terms

f = x+y+z+120+1

g = f+1

Problem #2:  1771 x 1771 = 12341 terms

f = x2+y2+z2+120+1

g = f+1

Problem #3:  5456 x 5456 = 39711 terms

f = x+y+z+130+1

g = f+1

Problem #4:  10626 x 10626 = 135751 terms

f = x+y+z+t+120+1

g = f+1

Problem #5:  9261 x 9261 = 12552 terms

f = 1+x201+y201+z20+1

g = 1x201y201z20+1

Problem #6:  3003 x 3003 = 417311 terms

f = 1+u2+v+w2+xy10+1

g = 1+y+v2+w+x2+y10+1

 

 

The timing data is given in matrix form below.  Mathematica® 9 is the first row, Maple 16 is the middle row, and Maple 17 is the last row.

expand=0.1260.1290.8355.6761.4531.3860.0300.0300.1680.6400.3830.6020.0120.0130.1010.4140.1550.082, divide=0.6620.6510.8359.71149.2956.3530.0330.0330.1850.7390.4020.6270.0140.0150.1160.4780.3250.095, factor=18.453114.100274.725971.881951.782429.7012.6803.08014.16046.60015.98049.0800.6750.8394.42012.8406.7105.900

(5.2.1)

 

Sparse Matrices and Vectors

Enhancements have been made in Maple 17 for the performance of stacking and augmenting sparse floating-point, hardware datatype Matrices and Vectors. Computing linear combinations of pairs of sparse floating-point, hardware datatype Matrices using the LinearAlgebra:-MatrixAdd command has also been improved.

 

In the examples below, the Matrix and Vector constructors are used for stacking or augmentation (concatenation).

 

Benchmarks computed in 64-bit Maple 17 for Linux on an Intel Core i5 CPU 760 @ 2.80GHz.

Sparse vector concatenation

In Maple 16, the concatenation of Vector V with itself three times would take approximately 28 seconds and would result in an output Vector with full rectangular storage. The allocated memory would increase by 1.27GiB.

 

In Maple 17, it takes less than 0.2 seconds and produces a Vector with sparse storage. Memory allocation increases by 30.5MiB.

Example

restart:

withLinearAlgebra:

V:=RandomVector107,generator=0.0..1.0,density=0.05,storage=sparse,datatype=float8:

rtable_num_elemsV,NonZeroStored

500366

(6.1.1)

biggerV:=CodeTools:-UsageVectorV,V,V,V

memory used=76.37MiB, alloc change=30.54MiB, cpu time=100.00ms, real time=121.00ms

biggerV:= 1 .. 40000000 VectorcolumnData Type: float8Storage: sparseOrder: Fortran_order

(6.1.2)

rtable_num_elemsbiggerV,NonZeroStored

2001464

(6.1.3)

Sparse matrix concatenation

In Maple 16, the concatenation of Matrix M with itself as performed below would take approximately 20 seconds and would result in an output Matrix with full rectangular storage.

 

In Maple 17, it takes less than 0.1 seconds and produces a Matrix with sparse storage. Memory allocation increases by 48MiB.

Example

restart:

withLinearAlgebra:

M:=RandomMatrix103,103,generator=0.0..1.0,density=0.01,storage=sparse,datatype=float8:

rtable_num_elemsM,NonZeroStored

10053

(6.2.1)

biggerM:=CodeTools:-UsageMatrixM,M

memory used=0.76MiB, alloc change=48.00MiB, cpu time=30.00ms, real time=55.00ms

biggerM:= 1000 x 2000 MatrixData Type: float8Storage: sparseOrder: Fortran_order

(6.2.2)

rtable_num_elemsbiggerM,NonZeroStored

20106

(6.2.3)

Simple linear combinations of sparse matrices

In Maple 16, the following command would take 3 minutes and allocated memory would increase by 400MiB.

 

In Maple 17, it takes less than 1 second and allocated memory increases by 76MiB.

Example

restart:

withLinearAlgebra:

M:=RandomMatrix104,104,generator=0.0..1.0,density=0.01,storage=sparse,datatype=float8:

CodeTools:-UsageMatrixAddM,M,2,3.3

memory used=76.67MiB, alloc change=76.35MiB, cpu time=360.00ms, real time=417.00ms

10000 x 10000 MatrixData Type: float8Storage: sparseOrder: Fortran_order

(6.3.1)

 

See Also

Elementary Functions in Maple 17, Memory Regions in Maple 17, Parallel Linear Algebra, Garbage Collection, Sparse Matrices and Vectors


Download Help Document

Was this information helpful?



Please add your Comment (Optional)
E-mail Address (Optional)
What is ? This question helps us to combat spam