# CAT L1.4 BUSINESS MATHEMATICS Study Manual

User Manual:

Open the PDF directly: View PDF .

Page Count: 529 [warning: Documents this large are best viewed by clicking the View PDF Link!]

- BM CAT Study Manual Cover Contents and Syllabus RO'Neill 2 08 12
- BM CAT Study Manual Unit 1-3 RO'Neill 2 08 12
- BM CAT Study Manual Unit 4 RO'Neill 2 08 12
- BM CAT Study Manual Unit 5-6 RO'Neill 2 08 12
- BM CAT Study Manual Unit 7-8 RO'Neill 2 08 12
- BM CAT Study Manual Unit 9 RO'Neill 2 08 12
- Option 2
- Option 1
- You are receiving these returns and only investing RWF10000 so your Net Present Value is
- RWF10773.20 - RWF10000 = RWF773.20.
- Since the NPV is positive, you must be receiving more than 10% on the investment.
- What rate of return is the investment yielding?
- 11%, 12%, 18%??
- The rate of return the investment is yielding is called the Internal Rate of Return.
- This is perfect because it is a negative number which is roughly the same as the positive number done earlier.
- The Internal Rate of Return is then estimated by drawing the following diagram:
- Regression and Correlation
- Break-even Analysis
- Break-even Chart
- Fixed, Variable and Marginal Costs
- Introduction

- BM CAT Study Manual Unit 10 RO'Neill 2 08 12
- BM CAT Study Manual Unit 11 RO'Neill 2 08 12
- BM CAT Study Manual Unit 12-13 RO'Neill 2 08 12
- BM CAT Study Manual Unit 14 RO'Neill 2 08 12
- BM CAT Study Manual Unit 15 RO'Neill 2 08 12
- BM CAT Study Manual Unit 16 RO'Neill 2 08 12

INSIDE COVER – BLANK

Page 1

© iCPAR

All rights reserved.

The text of this publication, or any part thereof, may not be reproduced or transmitted in any form

or by any means, electronic or mechanical, including photocopying, recording, storage in an

information retrieval system, or otherwise, without prior permission of the publisher.

Whilst every effort has been made to ensure that the contents of this book are accurate, no

responsibility for loss occasioned to any person acting or refraining from action as a result of any

material in this publication can be accepted by the publisher or authors. In addition to this, the

authors and publishers accept no legal responsibility or liability for any errors or omissions in

relation to the contents of this book.

INSTITUTE OF

CERTIFIED PUBLIC ACCOUNTANTS

OF

RWANDA

Level 1

L1.4 BUSINESS MATHEMATICS

First Edition 2012

This study manual has been fully revised and updated

in accordance with the current syllabus.

It has been developed in consultation with experienced lecturers.

Page 2

BLANK

Page 3

CONTENTS

Study

Unit

Title Page

Introduction to the Course

1: PROBABILITY

11

Estimating Probabilities

13

Types of Event

17

The Two Laws of Probability

19

Tree Diagrams

29

Binomial Distribution

39

Poisson Distribution

41

Venn diagrams

43

2: COLLECTION OF DATA

45

Collection of Data

47

Types of Data

49

Requirements of Statistical Data

51

Methods of Collecting Data

53

Interviewing

57

Designing the Questionnaire

59

Choice of Method

65

Pareto Distribution and the “80:20” Rule

67

3: TABULATION & GROUPING OF DATA

69

Introduction to Classification & Tabulation of Data

71

Forms of Tabulation

75

Secondary Statistical Tabulation

79

Rules for Tabulation

81

Sources of Data & Presentation Methods

85

4: GRAPHICAL REPRESENTATION OF INFORMATION

93

Introduction to Frequency Distributions

95

Preparation of Frequency Distributions

97

Cumulative Frequency Distributions

103

Relative Frequency Distributions

105

Graphical Representation of Frequency Distributions

107

Introduction to Other Types of Data Presentation

117

Pictograms

119

Pie Charts

123

Bar Charts

125

General Rules for Graphical Presentation

129

The Lorenz Curve

131

Page 4

Study

Unit

Title Page

5: AVERAGES OR MEASURES OF LOCATION

137

The Need for Measures of Location

139

The Arithmetic Mean

141

The Mode

153

The Median

159

6: MEASURES OF DISPERSION

165

Introduction to Dispersion

167

The Range

169

The Quartile Deviation, Deciles and Percentiles

171

The Standard Deviation

177

The Coefficient of Variation

183

Skewness

185

Averages & Measures of Dispersion

189

7: THE NORMAL DISTRIBUTION

203

Introduction

205

The Normal Distribution

207

Calculations Using Tables of the Normal Distribution

209

8: INDEX NUMBERS

215

The Basic Idea

217

Building Up an Index Number

219

Weighted Index Numbers

223

Formulae

229

Quantity or Volume Index Numbers

231

The Chain-Base Method

237

Deflation of Time Series

239

9: PERCENTAGES & RATIOS, SIMPLE & COMPOUND INTEREST,

DISCOUNTED CASH FLOW

245

Percentages

247

Ratios

249

Simple Interest

253

Compound Interest

257

Introduction to Discounted Cash Flow Problems

263

Two Basic DCF Methods

273

Introduction to Financial Mathematics

283

Manipulation of Inequalities

325

10: CORRELATION

327

General

329

Scatter Diagrams

331

The Correlation Coefficient

337

Rank Correlation

343

11: LINEAR REGRESSION

351

Introduction

353

Regression Lines

355

Use of Regression

361

Connection between Correlation and Regression

363

Page 5

Study

Unit

Title Page

12: TIME SERIES ANALYSIS I

365

Introduction

367

Structure of a Time Series

369

Calculation of Component Factors for the Additive Model

375

13: TIME SERIES ANALYSIS II

387

Forecasting

389

The Z Chart

393

Summary

395

14: LINEAR PROGRAMMING

397

The Graphical Method

399

The Graphical Method Using Simultaneous Equations

417

Sensitivity Analysis (graphical)

423

The Principles of the Simplex Method

433

Sensitivity Analysis (simplex)

447

Using Computer Packages

455

Using Linear Programming

459

15: RISK AND UNCERTAINTY

463

Risk & Uncertainty

465

Allowing for Uncertainty

467

Probabilities and Expected Value

471

Decision Rules

475

Decision Trees

481

The value of information

491

Sensitivity Analysis

503

Simulation Models

505

16: SPREADSHEETS

509

Origins of Spreadsheets

511

Modern Spreadsheets

513

Concepts

515

How Spreadsheets work

521

Users of Spreadsheets

523

Advantages & Disadvantages of Spreadsheets

525

Spreadsheets in Today’s Climate

527

Page 6

BLANK

Page 7

Stage: Level 1

Subject Title: L1.4 Business Mathematics

Aim

The aim of this subject is to provide students with the tools and techniques to understand the

mathematics associated with managing business operations. Probability and risk play an

important role in developing business strategy. Preparing forecasts and establishing the

relationships between variables are an integral part of budgeting and planning. Financial

mathematics provides an introduction to interest rates and annuities and to investment

appraisals for projects. Preparing graphs and tables in summarised formats and using

spreadsheets are important in both the calculation of data and the presentation of information

to users.

Learning Objectives:

On successful completion of this subject students should be able to show:

• Demonstrate the use of basic mathematics and solve equations and inequalities

• Calculate probability and demonstrate the use of probability where risk and

uncertainty exists

• Apply techniques for summarising and analysing data

• Calculate correlation coefficient for bivariate data and apply techniques for simple

regression

• Demonstrate forecasting techniques and prepare forecasts

• Calculate present and future values of cash flows and apply financial mathematical

techniques

• Apply spreadsheets to calculate and present data

Page 8

Syllabus:

1. Basic Mathematics

• Use of formulae, including negative powers as in the formulae for the learning

curve

• Order of operations in formulae, including brackets, powers and roots

• Percentages and ratios

• Rounding of numbers

• Basic algebraic techniques and solution of equations, including simultaneous

equations and quadratic equations

• Graphs of linear and quadratic equations

• Manipulation of inequalities

2. Probability

• Probability and its relationship with proportion and per cent

• Addition and multiplication rules of probability theory

• Venn diagrams

• Expected values and expected value tables

• Risk and uncertainty

3. Summarising and Analysing Data

• Data and information

• Tabulation of data

• Graphs, charts and diagrams: scatter diagrams, histograms, bar charts and ogives.

• Summary measures of central tendency and dispersion for both grouped and

ungrouped data

• Frequency distributions

• Normal distribution

• Pareto distribution and the “80:20” rule

• Index numbers

Page 9

4. Relationships between variables

• Scatter diagrams

• Correlation co-efficient: Spearman’s rank correlation coefficient and Pearson’s

correlation coefficient

• Simple linear regression

5. Forecasting

• Time series analysis – graphical analysis

• Trends in time series – graphs, moving averages and linear regressions

• Seasonal variations using both additive and multiplicative models

• Forecasting and its limitations

6. Financial Mathematics

• Simple and compound interest

• Present value(including using formulae and tables)

• Annuities and perpetuities

• Loans and Mortgages

• Sinking funds and savings funds

• Discounting to find net present value (NPV) and internal rate of return (IRR)

• Interpretation of NPV and IRR

Page 10

7. Spreadsheets

• Features and functions of commonly used spreadsheet software, workbook,

worksheet, rows, columns, cells, data, text, formulae, formatting, printing, graphs and

macros.

• Advantages and disadvantages of spreadsheet software, when compared to manual

analysis and other types of software application packages

• Use of spreadsheet software in the day to day work: budgeting, forecasting, reporting

performance, variance analysis, what-if analysis, discounted cash flow calculations

Page 11

STUDY UNIT 1

Probability

Contents

Unit

Title

Page

A.

Estimating Probabilities

13

Introduction

13

Theoretical Probabilities

14

Empirical Probabilities

15

B.

Types of Event

17

C.

The Two Laws of Probability

19

Addition Law for Mutually Exclusive Events

19

Addition Law for a Complete List of Mutually Exclusive Events

20

Addition Law for Non-Mutually-Exclusive Events

Multiplication Law for Independent Events

Distinguishing the Laws

D.

Tree Diagrams

29

Examples

29

E.

Binomial Distribution

39

F.

Poisson Distribution

41

G.

Venn Diagrams

43

Page 12

BLANK

Page 13

A. ESTIMATING PROBABILITIES

Introduction

Suppose someone tells you “there is a 50-50 chance that we will be able to deliver your order

on Friday”. This statement means something intuitively, even though when Friday arrives

there are only two outcomes. Either the order will be delivered or it will not. Statements like

this are trying to put probabilities or chances on uncertain events.

Probability is measured on a scale between 0 and 1. Any event which is impossible has a

probability of 0, and any event which is certain to occur has a probability of 1. For example,

the probability that the sun will not rise tomorrow is 0; the probability that a light bulb will

fail sooner or later is 1. For uncertain events, the probability of occurrence is somewhere

between 0 and 1. The 50-50 chance mentioned above is equivalent to a probability of 0.5.

Try to estimate probabilities for the following events. Remember that events which are more

likely to occur than not have probabilities which are greater than 0.5, and the more certain

they are the closer the probabilities are to 1. Similarly, events which are more likely not to

occur have probabilities which are less than 0.5. The probabilities get closer to 0 as the events

get more unlikely.

(a) The probability that a coin will fall heads when tossed.

(b) The probability that it will snow next Christmas.

(c) The probability that sales for your company will reach record levels next year.

(d) The probability that your car will not break down on your next journey.

(e) The probability that the throw of a dice will show a six.

Page 14

The probabilities are as follows:

(a) The probability of heads is 0.5.

(b) This probability is quite low. It is somewhere between 0 and 0.1.

(c) You can answer this one yourself.

(d) This depends on how frequently your car is serviced. For a reliable car it should be

greater than 0.99.

(e) The probability of a six is 1/6 or 0.167.

Theoretical Probabilities

Sometimes probabilities can be specified by considering the physical aspects of the situation.

For example, consider the tossing of a coin. What is the probability that it will fall heads?

There are two sides to a coin. There is no reason to favour either side as a coin is

symmetrical. Therefore the probability of heads, which we call P(H) is:

P(H) = 0.5.

Another example is throwing a dice. A dice has six sides. Again, assuming it is not weighted

in favour of any of the sides, there is no reason to favour one side rather than another.

Therefore the probability of a six showing uppermost, P(6), is:

P(6) = 1/6 = 0.167.

Page 15

As a third and final example, imagine a box containing 100 beads of which 23 are black and

77 white. If we pick one bead out of the box at random (blindfold and with the box well

shaken up) what is the probability that we will draw a black bead? We have 23 chances out of

100, so the probability is:

(or P = 0.23)

Probabilities of this kind, where we can assess them from our prior knowledge of the

situation, are also called “a priori” probabilities.

In general terms, we can say that if an event E can happen in h ways out of a total of n

possible equally likely ways, then the probability of that event occurring (called a success) is

given by:

P(E) =

=

Empirical Probabilities

Often it is not possible to give a theoretical probability of an event. For example, what is the

probability that an item on a production line will fail a quality control test? This question can

be answered either by measuring the probability in a test situation (i.e. empirically) or by

relying on previous results. If 100 items are taken from the production line and tested, then:

Probability of failure P(F) =

So, if 5 items actually fail the

test

23

–––

100

h

–––

n

Number of possible ways of E occurring

----------------------------------------------------

Total number of possible outcomes

Number of items which fail

----------------------------------------------------

Total number of items tested

Page 16

P(F) = = 0.05.

Sometimes it is not possible to set up an experiment to calculate an empirical probability. For

example, what are your chances of passing a particular examination? You cannot sit a series

of examinations to answer this. Previous results must be used. If you have taken 12

examinations in the past, and failed only one, you might estimate:

5

------

100

Page 17

B. TYPES OF EVENT

There are five types of event:

• Mutually exclusive

• Non-mutually-exclusive

• Independent

• Dependent or non-independent

• Complementary.

(a) Mutually Exclusive Events

If two events are mutually exclusive then the occurrence of one event precludes the

possibility of the other occurring. For example, the two sides of a coin are mutually exclusive

since, on the throw of the coin, “heads” automatically rules out the possibility of “tails”. On

the throw of a dice, a six excludes all other possibilities. In fact, all the sides of a dice are

mutually exclusive; the occurrence of any one of them as the top face automatically excludes

any of the others.

(b) Non-Mutually-Exclusive Events

These are events which can occur

. For example, in a pack of playing cards hearts and queens are non-mutually-exclusive since

there is one card, the queen of hearts, which is both a heart and a queen and so satisfies both

criteria for success.

(c) Independent Events

These are events which are not mutually exclusive and where the occurrence of one event

does not affect the occurrence of the other. For example, the tossing of a coin in no way

affects the result of the next toss of the coin; each toss has an independent outcome.

Page 18

(d) Dependent or Non-Independent Events

These are situations where the outcome of one event is dependent on another event. The

probability of a car owner being able to drive to work in his car is dependent on him being

able to start the car. The probability of him being able to drive to work given that the car

starts is a conditional probability and

P(Drive to work|Car starts)

where the vertical line is a shorthand way of writing “given that”.

(e) Complementary Events

An event either occurs or it does not occur, i.e. we are certain that one or other of these

situations holds.

For example, if we throw a dice and denote the event where a six is uppermost by A, and the

event where either a one, two, three, four or five is uppermost by Ā (or not A) then A and Ā

are complementary, i.e. they are mutually exclusive with a total probability of 1. Thus:

P(A) + P(Ā) = 1.

This relationship between complementary events is useful as it is often easier to find the

probability of an event not occurring than to find the probability that it does occur. Using the

above formula, we can always find P(A) by subtracting P(Ā) from 1.

Page 19

C. THE TWO LAWS OF PROBABILITY

Addition Law for Mutually Exclusive Events

Consider again the example of throwing a dice. You will remember that

What is the chance of getting 1, 2 or 3?

From the symmetry of the dice you can see that P(1 or 2 or 3) = 0.5. But also, from the

equations shown above you can see that

P(1) + P(2) + P(3) = 1/6 + 1/6 + 1/6 = 0.5.

This illustrates that

Page 20

P(1 or 2 or 3) = P(1) + P(2) + P(3)

This result is a general one and it is called the addition law of probabilities for mutually

exclusive events. It is used to calculate the probability of one of any group of mutually

exclusive events. It is stated more generally as:

P(A or B or ... or N) = P(A) + P(B) + ... + P(N)

where A, B ... N are mutually exclusive events.

Addition Law for a Complete List of Mutually Exclusive

Events

(a) If all possible mutually exclusive events are listed, then it is certain that one of these

outcomes will occur. For example, when the dice is tossed there must be one number

showing afterwards.

P(1 or 2 or 3 or 4 or 5 or 6) = 1.

Using the addition law for mutually exclusive events, this can also be stated as

P(1) + P(2) + P(3) + P(4) + P(5) + P(6) = 1.

Again this is a general rule. The sum of the probabilities of a complete list of mutually

exclusive events will always be 1.

Page 21

Example

An urn contains 100 coloured balls. Five of these are red, seven are blue and the rest are

white. One ball is to be drawn at random from the urn.

What is the probability that it will be red?

What is the probability that it will be red or blue?

P(R or B) = P(R) + P(B) = 0.05 + 0.07 = 0.12.

This result uses the addition law for mutually exclusive events since a ball cannot be both

blue and red.

What is the probability that it will be white?

The ball must be either red or blue or white. This is a complete list of mutually exclusive

possibilities.

Therefore P(R) + P(B) + P(W) = 1

P(W) = 1 – P(R) – P(B)

= 1 – 0.05- 0.07

= 0.88

Page 22

Addition Law for Non-Mutually-Exclusive Events

Events which are non-mutually-exclusive are, by definition, capable of occurring together.

The addition law can still be used but the probability of the events occurring together must be

deducted:

P(A or B or both) = P(A) + P(B) – P(A and B).

Examples

(a) If one card is drawn from a pack of 52 playing cards, what is the probability: (i) that it

is either a spade or an ace; (ii) that it is either a spade or the ace of diamonds?

(i) Let event B be “the card is a spade”. Let event A be “the card is an ace”.

We require P(spade or ace [or both]) = P(A or B)

= P(A) + P(B) – P(A and B)

Page 23

(b) At a local shop 50% of customers buy unwrapped bread and 60% buy wrapped bread.

What proportion of customers buy at least one kind of bread if 20% buy both wrapped

and unwrapped bread?

Let S represent all the customers.

Let T represent those customers buying unwrapped bread.

Let W represent those customers buying wrapped bread.

P(buy at least one kind of bread) = P(buy wrapped or unwrapped or both)

= P(T or W)

= P(T) + P(W) – P(T and W)

= 0.5 + 0.6 – 0.2

= 0.9

Page 24

So, 9/10 of the customers buy at least one kind of bread.

Multiplication Law for Independent Events

Consider an item on a production line. This item could be defective or acceptable. These two

possibilities are mutually exclusive and represent a complete list of alternatives. Assume that:

Probability that it is defective, P(D) = 0.2

Probability that it is acceptable, P(A) = 0.8.

Now consider another facet of these items. There is a system for checking them, but only

every tenth item is checked. This is shown as:

Probability that it is checked P(C) = 0.1

Probability that it is not checked P(N) = 0.9.

Again these two possibilities are mutually exclusive and they represent a complete list of

alternatives. An item is either checked or it is not.

Consider the possibility that an individual item is both defective and not checked. These two

events can obviously both occur together so they are not mutually exclusive. They are,

however, independent. That is to say, whether an item is defective or acceptable does not

affect the probability of it being tested.

There are also other kinds of independent events. If you toss a coin once and then again a

second time, the outcome of the second test is independent of the results of the first one. The

Page 25

results of any third or subsequent test are also independent of any previous results. The

probability of heads on any test is 0.5 even if all the previous tests have resulted in heads.

To work out the probability of two independent events both happening, you use the

multiplication law. This can be stated as:

P(A and B) = P(A) x P(B) if A and B are independent events.

Again this result is true for any number of independent events.

So P(A and B and ... and N) = P(A) x P(B) x ... x P(N).

Consider the example above. For any item:

Probability that it is defective, P(D) = 0.2

Probability that it is acceptable, P(A) = 0.8

Probability that it is checked, P(C) = 0.1

Probability that it is not checked, P(N) = 0.9.

Using the multiplication law to calculate the probability that an item is both defective and not

checked

P(D and N) = 0.2 x 0.9 = 0.18.

The probabilities of the other combinations of independent events can also be calculated.

P(D and C) = 0.2 x 0.1 = 0.02

P(A and N) = 0.8 x 0.9 = 0.72

P(A and C) = 0.8 x 0.1 = 0.08.

Page 26

Examples

a) A machine produces two batches of items. The first batch contains 1,000 items of

which 20 are damaged. The second batch contains 10,000 items of which 50 are

damaged. If one item is taken from each batch, what it the probability that both items

are defective?

Since these two probabilities are independent

P(D1 and D2) = P(D1) x P(D2) = 0.02 x 0.005 = 0.0001.

b) A card is drawn at random from a well shuffled pack of playing cards. What is the

probability that the card is a heart? What is the probability that the card is a three?

What is the probability that the card is the three of hearts?

Page 27

c) A dice is thrown three times. What is the probability of one or more sixes in these

three throws?

Distinguishing the Laws

Although the above laws of probability are not complicated, you must think carefully and

clearly when using them. Remember that events must be mutually exclusive before you can

use the addition law, and they must be independent before you can use the multiplication law.

Another matter about which you must be careful is the listing of equally likely outcomes. Be

sure that you list all of them. For example, we can list the possible results of tossing two

coins, namely:

First Coin Second Coin

Heads Heads

Tails Heads

Heads Tails

Tails Tails

There are four equally likely outcomes. Do not make the mistake of saying, for example, that

there are only two outcomes (both heads or not both heads); you must list all the possible

outcomes. (In this case “not both heads” can result in three different ways, so the probability

of this result will be higher than “both heads”.)

Page 28

In this example, the probability that there will be one heads and one tails (heads - tails, or

tails - heads) is 0.5. This is a case of the addition law at work, the probability of heads - tails (

1/4 ) plus the probability of tails - heads ( 1/4 ). Putting it another way, the probability of

different faces is equal to the probability of the same faces - in both cases1/2.

Page 29

D. TREE DIAGRAMS

A compound experiment, i.e. one with more than one component part, may be regarded as a

sequence of similar experiments. For example, the rolling of two dice can be considered as

the rolling of one followed by the rolling of the other; and the tossing of four coins can be

thought of as tossing one after the other. A tree diagram enables us to construct an exhaustive

list of mutually exclusive outcomes of a compound experiment.

Furthermore, a tree diagram gives us a pictorial representation of probability.

By exhaustive, we mean that every possible outcome is considered.

By mutually exclusive we mean, as before, that if one of the outcomes of the compound

experiment occurs then the others cannot.

Examples

a) The concept can be illustrated using the example of a bag containing five red and

three white billiard balls. If two are selected at random without replacement, what is

the probability that one of each colour is drawn?

We can represent this as a tree diagram as in Figure 1.

N.B. R indicates red ball

W indicates white ball.

Probabilities at each stage are shown alongside the branches of the tree.

Figure 1.1

Page 30

Table 1.1

Outcome

Probability

5

4

20

------ x ----- = –––

8

7

56

5

3

15

----- x ----- = –––

8

7

56

3

5

15

----- x ----- = –––

8

7

56

3

2

6

------ x ----- = –––

8

7

56

Total

1

RR

RW

WR

WW

Page 31

We work from left to right in the tree diagram. At the start we take a ball from the bag. This

ball is either red or white so we draw two branches labelled R and W, corresponding to the

two possibilities. We then also write on the branch the probability of the outcome of this

simple experiment being along that branch.

We then consider drawing a second ball from the bag. Whether we draw a red or a white ball

the first time, we can still draw a red or a white ball the second time, so we mark in the two

possibilities at the end of each of the two branches of our existing tree diagram. We can then

see that there are four different mutually exclusive outcomes possible, namely RR, RW, WR

and WW. We enter on these second branches the conditional probabilities associated with

them.

Thus, on the uppermost branch in the diagram we must insert the probability of

obtaining a second red ball given that the first was red. This probability is 4/7 as there are

only seven balls left in the bag, of which four are red. Similarly for the other branches.

Each complete branch from start to tip represents one possible outcome of the compound

experiment and each of the branches is mutually exclusive. To obtain the probability of a

particular outcome of the compound experiment occurring, we multiply the probabilities

along the different sections of the branch, using the general multiplication law for

probabilities.

We thus obtain the probabilities shown in Table 1.1. The sum of the probabilities should add

up to 1, as we know one or other of these mutually exclusive outcomes is certain to happen.

Page 32

b) A bag contains three red balls, two white balls and one blue ball. Two balls are drawn

at random (without replacement). Find the probability that:

i. Both white balls are drawn.

ii. The blue ball is not drawn.

iii. A red then a white are drawn.

iv. A red and a white are drawn.

To solve this problem, let us build up a tree diagram.

Figure 1.2

The first ball drawn has a subscript of 1, e.g. red first = R1. The second ball drawn has a

subscript of 2.

Page 33

Note there is only one blue ball in the bag, so if we picked a blue ball first then we can have

only a red or a white second ball. Also, whatever colour is chosen first, there are only five

balls left as we do not have replacement

Figure 1.3

Page 34

We can now list all the possible outcomes, with their associated probabilities:

Table 1.2

It is possible to read off the probabilities we require from Table 1.2.

(i) Probability that both white balls are drawn:

Page 35

(ii) Probability the blue ball is not drawn:

Probability that a red then a white are drawn:

Probability that a red and a white are drawn:

c) A couple go on having children, to a maximum of four, until they have a son. Draw a

tree diagram to find the possible families’ size and calculate the probability that they

have a son.

We assume that any one child is equally likely to be a boy or a girl, i.e. P(B) = P(G) = 1/2 .

Note that once they have produced a son, they do not have any more children. The tree

diagram will be as in Figure 1.4.

Page 36

Figure 1.4

Table 1.3

Possible Families

Probability

1

1 Boy

––

2

1

1

1 Girl, 1 Boy

( –– ) 2 = ––

2

4

1

1

2 Girls, 1 Boy

( –– ) 3 = ––

2

8

1

1

3 Girls, 1 Boy

( –– ) 4 = ––

2

16

1

1

4 Girls

( –– ) 4 = ––

2

16

Total

= 1

Page 37

Probability they have a son is therefore:

Page 38

BLANK

Page 39

E. BINOMIAL DISTRIBUTION

The binomial distribution can be used to describe the likely outcome of events for discrete

variables which:

(a) Have only two possible outcomes; and

(b) Are independent.

Suppose we are conducting a questionnaire. The Binomial distribution might be used to

analyse the results if the only two responses to a question are ‘yes’ or ‘no’ and if the response

to one question (eg, ‘yes’) does not influence the likely response to any other question (ie

‘yes’ and ‘no’).

Put rather more formally, the Binomial distribution occurs when there are n independent trials

(or tests) with the probability of ‘success’ or ‘failure’ in each trial (or test) being constant.

Let p = the probability of ‘success’

Let q = the probability of ‘failure’

Then q = 1 – p

For example, if we toss an unbiased coin ten times, we might wish to find the probability of

getting four heads! Here n = 10, p (head) = 0.5, q (tail) = 0.5 and q = 1 – p.

The probability of obtaining r ‘successes’ in ‘n’ trials (tests) is given by the following

formula:

!)!(

!

rrn

n

C

rn

−

=

where C is the number of combinations.

The probability of getting exactly four heads out of ten tosses of an unbiased coin, can

therefore be solved as:

P(4) = 10C40.540.56

now

210

1234

78910

!4)!410(

!10

410 =

×××

×××

=

−

=C

so P(4) = 210 x (0.5)4 x (0.5)6

Page 40

P(4) = 210 x 0.625 x 0.015625

P(4) = 0.2051

In other words the probability of getting exactly four heads out of ten tosses of an unbiased

coin is 0.2051 or 20.51%.

It may be useful to state the formulae for finding all the possible probabilities of obtaining r

successes in n trials.

Where P(r) = nCrprqn-r

And r = 0, 1, 2, 3, ….n

then, from our knowledge of combinations

P(O) = qn

P(1) = npq n-1

P(2) = n (n-1)P2qn-2

2x1

P(3) = n(n-1) (n-2) p3qn-3

3x2x1

P(4) = n(n-1)(n-2)(n-3) p4qn-4

4x3x2x1

P (n-2) = n(n-1) pn-2q2

2x1

P (n-1) = npn-1q

P(n) = pn

Page 41

F. POISSON DISTRIBUTION

Introduction

The poisson distribution may be regarded as a special case of the binomial distribution. As

with the Binomial distribution, the Poisson distribution can be used where there are only two

possible outcomes:-

1. Success (p)

2. Failure (q)

These events are independent. The Poisson distribution is usually used where n is very large

but p is very small, and where the mean np is constant and typically < 5. As p is very small

(p < 0.1 and often much less), then the chance of the event occurring is extremely low. The

Poisson distribution is therefore typically used for unlikely events such as accidents, strikes

etc.

The Poisson distribution is also used to solve problems where events tend to occur at random,

such as incoming phone calls, passenger arrivals at a terminal etc.

Whereas the formula for solving Binomial problems uses the probabilities, for both “success”

(p) and “failure” (q), the formula for solving Poisson problems only uses the probabilities for

“success” (p).

If µ is the mean, it is possible to show that the probability of r successes is given by the

formula:

P(r) = e-µµr

r!

where e = exponential constant = 2.7183

µ = mean number of successes = np

n = number of trails

p = probability of “success”

r = number of successes

If we substitute r = 0, 1, 2, 3, 4, 5…. in this formula we obtain the following expressions:

P (O) = e-µ

P (1) = µe-u

P (2) = µ2e-µ

2 x 1

P (3) = µ3e-µ

Page 42

3x2x1

P (4) = µ4e-µ

4x3x2x1

P (5) = µ5e-µ

5x4x3x2x1

In questions you are either given the mean µ or you have to find µ from the information

given, which is usually data for n and p; µ is then obtained from the relationship µ = np.

You have to be able to work out e raised to a negative power.

e-3 is the same as 1 so you can simply work this out using 1

e3 2.71833

Alternatively, many calculators have a key marked ex. The easiest way to find e-3 on your

calculator is to enter 3, press +/- key, press e key, and you should obtain 0.049787. If your

calculator does not have an e key but has an xy key, enter 2.7813, press xy key, enter 3, press

+/- key, then press = key; you should obtain 0.049786.

Page 43

G. VENN DIAGRAMS

Definition:

Venn diagrams or set diagrams are diagrams that show all possible logical relations between

a finite collection of sets.

A Venn diagram is constructed with a collection of simple closed curves drawn in a plane.

The “principle of these diagrams is that classes (or sets) be represented by regions in such

relation to one another that all the possible logical relations of these classes can be indicated

in the same diagram. That is, the diagram initially leaves room for any possible relation of

the classes, and the actual or given relation, can then be specified by indication that some

particular region is nor or is not null”

Venn diagrams normally comprise overlapping circles. The interior of the circle

symbolically represents the elements of the set, while the exterior represents elements that are

not members of the set.

For example: in a two-set Venn diagram, one circle may represent the group of all wooden

objects, while another circle may represent the set of all tables. The overlapping area or

intersection would then represent the set of all wooden tables.

Venn Diagram that shows the intersections of the Greek, Latin and Russian alphabets (upper

case letters)

Page 44

BLANK

Page 45

STUDY UNIT 2

Collection of Data

Contents

Unit

Title

Page

A.

Collection of Data - Preliminary Considerations

47

Exact Definition of the Problem

47

Definition of the Units

47

Scope of the Enquiry

47

Accuracy of the Data

48

B.

Types of Data

49

Primary and Secondary Data

49

Quantitative/Qualitative Categorisation

49

Continuous/Discrete Categorisation

49

C.

Requirements of Statistical Data

51

Homogeneity

51

Completeness

51

Accurate Definition

52

Uniformity

52

D.

Methods of Collecting Data

53

Published Statistics

53

Personal Investigation/Interview

54

Delegated Personal Investigation/Interview

54

Questionnaire

54

E.

Interviewing

57

Advantages of Interviewing

57

Disadvantages of Interviewing

57

F.

Designing the Questionnaire

59

Principles

59

An Example

60

G.

Choice of Method

65

H.

Pareto Distribution and the “80:20” Rule

67

Page 46

BLANK

Page 47

A. COLLECTION OF DATA - PRELIMINARY

CONSIDERATIONS

Even before the collection of data starts, there are some important points to consider when

planning a statistical investigation. Shortly I will give you a list of these together with a few

notes on each; some of them you may think obvious or trivial, but do not neglect to learn

them because they are very often the points which are overlooked. Furthermore, examiners

like to have lists as complete as possible when they ask for them!

What, then, are these preliminary matters?

Exact Definition of the Problem

This is necessary in order to ensure that nothing important is omitted from the enquiry, and

that effort is not wasted by collecting irrelevant data. The problem as originally put to the

statistician is often of a very general type and it needs to be specified precisely before work

can begin.

Definition of the Units

The results must appear in comparable units for any analysis to be valid. If the analysis is

going to involve comparisons, then the data must all be in the same units. It is no use just

asking for “output” from several factories - some may give their answers in numbers of items,

some in weight of items, some in number of inspected batches and so on.

Scope of the Enquiry

No investigation should be got under way without defining the field to be covered. Are we

interested in all departments of our business, or only some? Are we to concern ourselves with

our own business only, or with others of the same kind?

Page 48

Accuracy of the Data

To what degree of accuracy is data to be recorded? For example, are ages of individuals to be

given to the nearest year or to the nearest month or as the number of completed years? If

some of the data is to come from measurements, then the accuracy of the measuring

instrument will determine the accuracy of the results. The degree of precision required in an

estimate might affect the amount of data we need to collect. In general, the more precisely we

wish to estimate a value, the more readings we need to take.

Page 49

B. TYPES OF DATA

Primary and Secondary Data

In its strictest sense, primary data is data which is both original and has been obtained in

order to solve the specific problem in hand. Primary data is therefore raw data and has to be

classified and processed using appropriate statistical methods in order to reach a solution to

the problem.

Secondary data is any data other than primary data. Thus it includes any data which has been

subject to the processes of classification or tabulation or which has resulted from the

application of statistical methods to primary data, and all published statistics.

Quantitative/Qualitative Categorisation

Variables may be either quantitative or qualitative. Quantitative variables, to which we shall

restrict discussion here, are those for which observations are numerical in nature. Qualitative

variables have non-numeric observations, such as colour of hair, although, of course, each

possible non-numeric value may be associated with a numeric frequency.

Continuous/Discrete Categorisation

Variables may be either continuous or discrete. A continuous variable may take any value

between two stated limits (which may possibly be minus and plus infinity). Height, for

example, is a continuous variable, because a person’s height may (with appropriately accurate

equipment) be measured to any minute fraction of a millimetre. A discrete variable, however,

can take only certain values occurring at intervals between stated limits. For most (but not all)

discrete variables, these interval values are the set of integers (whole numbers).

For example, if the variable is the number of children per family, then the only possible

values are 0, 1, 2, ... etc. because it is impossible to have other than a whole number of

children. However, in Ireland, shoe sizes are stated in half-units, and so here we have an

example of a discrete variable which can take the values 1, 11/2, 2, 21/2, etc.

Page 50

BLANK

Page 51

C. REQUIREMENTS OF STATISTICAL DATA

Having decided upon the preliminary matters about the investigation, the statistician must

look in more detail at the actual data to be collected. The desirable qualities of statistical data

are the following:

– Homogeneity

– Completeness

– Accurate definition

– Uniformity.

Homogeneity

The data must be in properly comparable units. “Five houses” means little since five

dwelling houses are very different from five ancestral castles. Houses cannot be compared

unless they are of a similar size or value. If the data is found not to be homogeneous, there

are two methods of adjustment possible.

a) Break down the group into smaller component groups which are homogeneous and

study them separately.

b) Standardise the data. Use units such as “output per man-hour” to compare the output

of two factories of very different size. Alternatively, determine a relationship between

the different units so that all may be expressed in terms of one; in food consumption

surveys, for example, a child may be considered equal to half an adult.

Completeness

Great care must be taken to ensure that no important aspect is omitted from the enquiry.

Page 52

Accurate Definition

Each term used in an investigation must be carefully defined; it is so easy to be slack about

this and to run into trouble. For example, the term “accident” may mean quite different things

to the injured party, the police and the insurance company! Watch out also, when using other

people’s statistics, for changes in definition. Laws may, for example, alter the definition of an

“indictable offence” or of an “unemployed person”.

Uniformity

The circumstances of the data must remain the same throughout the whole investigation. It is

no use, for example, comparing the average age of workers in an industry at two different

times if the age structure has changed markedly. Likewise, it is not much use comparing a

firm’s profits at two different times if the working capital has changed.

Page 53

D. METHODS OF COLLECTING DATA

When all the foregoing matters have been dealt with, we come to the question of how to

collect the data we require. The methods usually available are as follows:

– Use of published statistics

– Personal investigation/interview

– Delegated personal investigation/interview

– Questionnaire.

Published Statistics

Sometimes we may be attempting to solve a problem that does not require us to collect new

information, but only to reassemble and reanalyse data which has already been collected by

someone else for some other purpose.

We can often make good use of the great amount of statistical data published by

governments, the United Nations, nationalised industries, chambers of trade and commerce

and so on. When using this method, it is particularly important to be clear on the definition of

terms and units and on the accuracy of the data. The source must be reliable and the

information up-to-date.

This type of data is sometimes referred to as secondary data in that the investigator himself

has not been responsible for collecting it and it thus came to him “second-hand”. By contrast,

data which has been collected by the investigator for the particular survey in hand is called

primary data.

The information you require may not be found in one source but parts may appear in several

different sources. Although the search through these may be time-consuming, it can lead to

data being obtained relatively cheaply and this is one of the advantages of this type of data

collection. Of course, the disadvantage is that you could spend a considerable amount of time

looking for information which may not be available.

Another disadvantage of using data from published sources is that the definitions used for

variables and units may not be the same as those you wish to use. It is sometimes difficult to

establish the definitions from published information, but, before using the data, you must

establish what it represent

Page 54

Personal Investigation/Interview

In this method the investigator collects the data himself. The field he can cover is, naturally,

limited. The method has the advantage that the data will be collected in a uniform manner

and with the subsequent analysis in mind. There is sometimes a danger to be guarded against

though, namely that the investigator may be tempted to select data that accords with some of

his preconceived notions.

The personal investigation method is also useful if a pilot survey is carried out prior to the

main survey, as personal investigation will reveal the problems that are likely to occur.

Delegated Personal Investigation/Interview

When the field to be covered is extensive, the task of collecting information may be too great

for one person. Then a team of selected and trained investigators or interviewers may be

used. The people employed should be properly trained and informed of the purposes of the

investigation; their instructions must be very carefully prepared to ensure that the results are

in accordance with the “requirements” described in the previous section of this study unit. If

there are many investigators, personal biases may tend to cancel out.

Care in allocating the duties to the investigators can reduce the risks of bias. For example, if

you are investigating the public attitude to a new drug in two towns, do not put investigator A

to explore town X and investigator B to explore town Y, because any difference that is

revealed might be due to the towns being different, or it might be due to different personal

biases on the part of the two investigators. In such a case, you would try to get both people to

do part of each town.

Questionnaire

In some enquiries the data consists of information which must be supplied by a large number

of people. Then a very convenient way to collect the data is to issue questionnaire forms to

the people concerned and ask them to fill in the answers to a set of printed questions. This

method is usually cheaper than delegated personal investigation and can cover a wider field.

A carefully thought-out questionnaire is often also used in the previous methods of

investigation in order to reduce the effect of personal bias.

Page 55

The distribution and collection of questionnaires by post suffers from two main drawbacks:

a) The forms are completed by people who may be unaware of some of the requirements

and who may place different interpretations on the questions - even the most carefully

worded ones!

b) There may be a large number of forms not returned, and these may be mainly by

people who are not interested in the subject or who are hostile to the enquiry. The

result is that we end up with completed forms only from a certain kind of person and

thus have a biased sample.

It is essential to include a reply-paid envelope to encourage people to respond.

If the forms are distributed and collected by interviewers, a greater response is likely and

queries can be answered. This is the method used, for example, in the Population Census.

Care must be taken, however, that the interviewers do not lead respondents in any way.

Page 56

BLANK

Page 57

E. INTERVIEWING

Advantages of Interviewing

There are many advantages of using interviewers in order to collect information.

The major one is that a large amount of data can be collected relatively quickly and cheaply.

If you have selected the respondents properly and trained the interviewers thoroughly, then

there should be few problems with the collection of the data.

This method has the added advantage of being very versatile since a good interviewer can

adapt the interview to the needs of the respondent. Similarly, if the answers given to the

questions are not clear, then the interviewer can ask the respondent to elaborate on them.

When this is necessary, the interviewer must be very careful not to lead the respondent into

altering rather than clarifying the original answers. The technique for dealing with this

problem must be tackled at the training stage.

This “face-to-face” technique will usually produce a high response rate. The response rate is

determined by the proportion of interviews that are successful.

Another advantage of this method of collecting data is that with a well-designed

questionnaire it is possible to ask a large number of short questions of the respondent in one

interview. This naturally means that the cost per question is lower than in any other method.

Disadvantages of Interviewing

Probably the biggest disadvantage of this method of collecting data is that the use of a large

number of interviewers leads to a loss of direct control by the planners of the survey.

Mistakes in selecting interviewers and any inadequacy of the training programme may not be

recognised until the interpretative stage of the survey is reached. This highlights the need to

train interviewers correctly. It is particularly important to ensure that all interviewers ask

questions in a similar manner. Even with the best will in the world, it is possible that an

inexperienced interviewer, just by changing the tone of his or her voice, may give a different

emphasis to a question than was originally intended.

In spite of these difficulties, this method of data collection is widely used as questions can be

answered cheaply and quickly and, given the correct approach, the technique can achieve

high response rates.

Page 58

BLANK

Page 59

F. DESIGNING THE QUESTIONNAIRE

Principles

A "questionnaire" can be defined as "a formulated series of questions, an interrogatory" and

this is precisely what it is. For a statistical enquiry, the questionnaire consists of a sheet (or

possibly sheets) of paper on which there is a list of questions the answers to which will form

the data to be analysed. When we talk about the "questionnaire method" of collecting data,

we usually have in mind that the questionnaires are sent out by post or are delivered at

people’s homes or offices and left for them to complete. In fact, however, the method is very

often used as a tool in the personal investigation methods already described.

The principles to be observed when designing a questionnaire are as follows:

a) Keep it as short as possible, consistent with getting the right results.

b) Explain the purpose of the investigation so as to encourage people to give the

answers.

c) Individual questions should be as short and simple as possible.

d) If possible, only short and definite answers like "Yes", "No", or a number of some

sort should be called for.

e) Questions should be capable of only one interpretation.

f) There should be a clear logic in the order in which the questions are asked.

g) There should be no leading questions which suggest the preferred answer.

h) The layout should allow easy transfer for computer input.

i) Where possible, use the "alternative answer" system in which the respondent has to

choose between several specified answers.

j) The respondent should be assured that the answers will be treated confidentially and

that the truth will not be used to his or her detriment.

k) No calculations should be required of the respondent.

The above principles should always be applied when designing a questionnaire and, in

addition, you should understand them well enough to be able to remember them all if you are

asked for them in an examination question. They are principles and not rigid rules - often one

has to go against some of them in order to get the right information. Governments can often

ignore these principles because they can make the completion of the questionnaire

compulsory by law, but other investigators must follow the rules as far as practicable in order

Page 60

to make the questionnaire as easy to complete as possible - otherwise they will receive no

replies.

An Example

An actual example of a self-completion questionnaire (Figure 7) is now shown as used by an

educational establishment in a research survey. Note that, as the questionnaire is incorporated

in this booklet, it does not give a true format. In practice, the questionnaire was not spread

over so many pages.

Figure 2.1

Page 61

Figure 2.2

Page 62

Figure 2.3

Page 63

Figure 2.4

Page 64

BLANK

Page 65

G. CHOICE OF METHOD

Choice is difficult between the various methods, as the type of information required will

often determine the method of collection. If the data is easily obtained by automatic methods

or can be observed by the human eye without a great deal of trouble, then the choice is easy.

The problem comes when it is necessary to obtain information by questioning respondents.

The best guide is to ask yourself whether the information you want requires an attitude or

opinion or whether it can be acquired from short yes/no type or similar simple answers. If it is

the former, then it is best to use an interviewer to get the information; if the latter type of data

is required, then a postal questionnaire would be more useful.

Do not forget to check published sources first to see if the information can be found from

data collected for another survey.

Another yardstick worth using is time. If the data must be collected quickly, then use an

interviewer and a short simple questionnaire. However, if time is less important than cost,

then use a postal questionnaire, since this method may take a long time to collect relatively

limited data, but is cheap.

Sometimes a question in the examination paper is devoted to this subject. The tendency is for

the question to state the type of information required and ask you to describe the appropriate

method of data collection giving reasons for your choice.

More commonly, specific definitions and explanations of various terms, such as interviewer

bias, are contained in multi-part questions.

Page 66

BLANK

Page 67

H. PARETO DISTRIBUTION AND THE “80:20” RULE

The Pareto distribution, named after the Italian economist Vilfredo Pareto, is a power law

probability distribution that coincides with social, scientific, geophysical, actuarial, and many

other types of observable phenomena.

Probability density function

Pareto Type I probability density functions for various α (labeled "k") with xm = 1. The horizontal axis is the x

parameter. As α → ∞ the distribution approaches δ(x − xm) where δ is the Dirac delta function

Cumulative distribution function

Pareto Type I cumulative distribution functions for various α(labeled "k") with xm = 1. The

horizontal axis is the x parameter.

Page 68

The “80:20 law”, according to which 20% of all people receive 80% of all income, and 20%

of the most affluent 20% receive 80% of that 80%, and so on, holds precisely when the

Pareto index is a=log4(5) = log (5)/log(4), approximately 1.1161.

Project managers know that 20% of the work consumes 80% of their time and resources.

You can apply the 80/20 rule to almost anything, from science of management to the physical

world.

80% of your sales will come from 20% of your sales staff. 20% of your staff will cause 80%

of your problems, but another 20% of your staff will provide 80% of your production. It

works both ways.

The value of the Pareto Principle for a manager is that it reminds you to focus on the 20%

that matters. Of the things you do during your day, only 20% really matter. Those 20%

produce 80% of your results. Identify and focus on those things.

Page 69

STUDY UNIT 3

Tabulation and Grouping of Data

Contents

Unit

Title

Page

A.

Introduction to Classification and Tabulation of Data

71

Example

71

B.

Forms of Tabulation

75

Simple Tabulation

75

Complex Tabulation

76

C.

Secondary Statistical Tabulation

79

D.

Rules for Tabulation

81

The Rules

81

An Example of Tabulation

82

E.

Sources of Data & Presentation Methods

85

Source, nature, application and use

85

Role of statistics in business analysis and decision making

86

Numerical data

90

Page 70

BLANK

Page 71

A. INTRODUCTION TO CLASSIFICATION AND

TABULATION OF DATA

Having completed the survey and collected the data, we need to organise it so that we can

extract useful information and then present our results. The information will very often

consist of a mass of figures in no very special order. For example, we may have a card index

of the 3,000 workers in a large factory; the cards are probably kept in alphabetical order of

names, but they will contain a large amount of other data such as wage rates, age, sex, type of

work, technical qualifications and so on. If we are required to present to the factory

management a statement about the age structure of the labour force (both male and female),

then the alphabetical arrangement does not help us, and no one could possibly gain any idea

about the topic from merely looking through the cards as they are. What is needed is to

classify the cards according to the age and sex of the worker and then present the results of

the classification as a tabulation. The data in its original form, before classification, is usually

known as “raw data”.

Example

We cannot, of course, give here an example involving 3,000 cards, but you ought now to

follow this “shortened version” involving only a small number of items.

a) Raw Data

15 cards in alphabetical order:

Ayim, L. Mr 39 years

Balewa, W. Mrs 20 “

Buhari, A. Mr 22 “

Boro, W. Miss 22 “

Chahine, S. Miss 32 “

Diop, T. Mr 30 “

Diya, C. Mrs 37 “

Eze, D. Mr 33 “

Egwu, R. Mr 45 “

Gowon, J. Mrs 42 “

Gaxa, F. Miss 24 “

Gueye, W. Mr 27 “

Jalloh, J. Miss 28 “

Jaja, J. Mr 44 “

Jang, L. Mr 39 “

Page 72

b) Classification

(i) According to Sex

Ayim, L. Mr 39 years Balewa, W. Mrs 20 years

Buhari, A. Mr 22 “ Boro, W. Miss 22 “

Diop, T. Mr 30 “ Chahine, S. Miss 32 “

Eze, D. Mr 33 “ Diya. C. Mrs 37 “

Egwu, R. Mr 45 “ Gowon, J. Mrs 42 “

Gueye, W. Mr 27 “ Gaxa, F. Miss 24 “

Jaja, J. Mr 44 “ Jalloh, J. Miss 28 “

Jang, L. Mr 39 “

(ii) According to Age (in Groups)

Balewa, W. Mrs 20 years Ayim, L. Mr 39 years

Buhari, A. Mr 22 “ Chahine, S. Miss 32 “

Boro, W. Miss 22 “ Diop, T. Mr 30 “

Gaxa, F. Miss 24 “ Diya, C. Mrs 37 “

Gueye, W. Mr 27 “ Eze, D. Mr 33 “

Jalloh, J. Miss 28 “ Jang, L. Mr 39 “

Egwu, R. Mr 45 years

Gowon, J. Mrs 42 “

Jaja, J. Mr 44 “

Page 73

c) Tabulation

The number of cards in each group, after classification, is counted and the results presented in

a table.

Table 3.2

You should look through this example again to make quite sure that you understand what has

been done.

You are now in a position to appreciate the purpose behind classification and tabulation - it is

to condense an unwieldy mass of raw data to manageable proportions and then to present the

results in a readily understandable form. Be sure that you appreciate this point, because

examination questions involving tabulation often begin with a first part which asks, "What

is the object of the tabulation of statistical data?", or words to that effect.

Page 74

BLANK

Page 75

B FORMS OF TABULATION

We classify the process of tabulation into Simple Tabulation and Complex or Matrix

Tabulation.

Simple Tabulation

This covers only one aspect of the set of figures. The idea is best conveyed by an example.

Consider the card index mentioned earlier; each card may carry the name of the workshop in

which the person works. A question as to how the labour force is distributed can be answered

by sorting the cards and preparing a simple table thus:

Table 3.3

Another question might have been, "What is the wage distribution in the works?", and the

answer can be given in another simple table (see Table 3.4).

Page 76

Table 3.4

Note that such simple tables do not tell us very much - although it may be enough for the

question of the moment.

Complex Tabulation

This deals with two or more aspects of a problem at the same time. In the problem just

studied, it is very likely that the two questions would be asked at the same time, and we could

present the answers in a complex table or matrix.

Page 77

Table 3.5

Note *140 - 159.99 is the same as "140 but less than 160" and similarly for the other

columns.

This table is much more informative than are the two simple tables, but it is more

complicated. We could have divided the groups further into, say, male and female workers, or

into age groups. In a later part of this study unit I will give you a list of the rules you should

try to follow in compiling statistical tables, and at the end of that list you will find a table

relating to our 3,000 workers, which you should study as you read the rules.

Page 78

BLANK

Page 79

C. SECONDARY STATISTICAL TABULATION

So far, our tables have merely classified the already available figures, the primary statistics,

but we can go further than this and do some simple calculations to produce other figures,

secondary statistics. As an example, take the first simple table illustrated above, and calculate

how many employees there are on average per workshop. This is obtained by dividing the

total (3,000) by the number of shops (5), and the table appears thus:

Table 3.6

This average is a "secondary statistic". For another example, we may take the second simple

table given above and calculate the proportion of workers in each wage group, thus:

Table 3.7

Page 80

These proportions are "secondary statistics". In commercial and business statistics, it is more

usual to use percentages than proportions; in the above tables these would be 3.5%, 17%,

30.7%, 33.8%, 10% and 5%.

Secondary statistics are not, of course, confined to simple tables, they are used in complex

tables too, as in this example:

The percentage columns and the average line show secondary statistics. All the other figures

are primary statistics.

Note carefully that percentages cannot be added or averaged to get the percentage of a

total or of an average. You must work out such percentages on the totals or averages

themselves.

Another danger in the use of percentages has to be watched, and that is that you must not

forget the size of the original numbers. Take, for example, the case of two doctors dealing

with a certain disease. One doctor has only one patient and he cures him - 100% success! The

other doctor has 100 patients of whom he cures 80 - only 80% success! You can see how very

unfair it would be on the hard-working second doctor to compare the percentages alone.

Table 3.8: Inspection Results for a Factory

Product in Two Successive Years

Page 81

D. RULES FOR TABULATION

The Rules

There are no absolute rules for drawing up statistical tables, but there are a few general

principles which, if borne in mind, will help you to present your data in the best possible way.

Here they are:

a) Try not to include too many features in any one table (say, not more than four or five)

as otherwise it becomes rather clumsy. It is better to use two or more separate tables.

b) Each table should have a clear and concise title to indicate its purpose.

c) It should be very clear what units are being used in the table (tonnes, RWF, people,

RWF000, etc.).

d) Blank spaces and long numbers should be avoided, the latter by a sensible degree of

approximation.

e) Columns should be numbered to facilitate reference.

f) Try to have some order to the table, using, for example, size, time, geographical

location or alphabetical order.

g) Figures to be compared or contrasted should be placed as close together as possible.

h) Percentages should be pleased near to the numbers on which they are based.

i) Rule the tables neatly - scribbled tables with freehand lines nearly always result in

mistakes and are difficulty to follow. However, it is useful to draw a rough sketch first

so that you can choose the best layout and decide on the widths of the columns.

j) Insert totals where these are meaningful, but avoid "nonsense totals". Ask yourself

what the total will tell you before you decide to include it. An example of such a

"nonsense total" is given in the following table:

Table 3.9 : Election Results

Page 82

The totals (470) at the foot of the two columns make sense because they tell us the total

number of seats being contested, but the totals in the final column (550, 390, 940) are

"nonsense totals" for they tell us nothing of value.

k) If numbers need to be totalled, try to place them in a column rather than along a row

for easier computation.

l) If you need to emphasise particular numbers, then underlining, significant spacing or

heavy type can be used. If data is lacking in a particular instance, then insert an

asterisk (*) in the empty space and give the reasons for the lack of data in a footnote.

m) Footnotes can also be used to indicate, for example, the source of secondary data, a

change in the way the data has been recorded, or any special circumstances which

make the data seem odd.

An Example of Tabulation

It is not always possible to obey all of these rules on any one occasion, and there may be

times when you have a good reason for disregarding some of them. But only do so if the

reason is really good - not just to save you the bother of thinking! Study now the layout of the

following table (based on our previous example of 3,000 workpeople) and check through the

list of rules to see how they have been applied.

Table 3.10: ABC & Co. Wage Structure of Labour Force Numbers

of Persons in Specified Categories

Page 83

Note (a) Total no. employed in workshop as a percentage of the total workforce.

Note (b) Total no. in wage group as a percentage of the total workforce.

Table 3.10 can be called a "twofold" table as the workforce is broken down by wage and

workshop.

Page 84

BLANK

Page 85

E. SOURCES OF DATA AND PRESENTATION

METHODS

Sources, nature, application and use:

Sources

Data is generally found through research or as the result of a survey. Data which is found

from a survey is called primary data; it is data which is collected for a particular reason or

research project. For example, if your firm wished to establish how much money tourists

spend on cultural events when they come to Rwanda or how long a particular process takes

on average to complete in a factory. In this case the data will be taken in raw form, i.e. lots of

figures and then analysed by grouping the data into more manageable groups. The other

source of data is secondary data. This is data which is already available (government

statistics, company reports etc). As a business person you can take these figures and use them

for whatever purpose you require.

Nature of data.

Data is classified according to the type of data it is. The classifications are as follows:

Categorical data: example: Do you currently own any stocks or bonds? Yes No

This type of data is generally plotted using a bar chart or pie chart.

Numerical data: This is usually divided into discrete or continuous data.

How many cars do you own? This is discrete data. This is data that arises from a counting

process.

How tall are you? This is continuous data. This is data that arises from a measuring process.

Or the figures cannot be measured precisely. For example: clock in times of the workers in a

particular shift: 8:23; 8:14; 8:16....

Whether data is discrete or continuous will determine the most appropriate method of

presentation.

Page 86

Descriptive

statistics

collecting

presenting

analysing

Precaution in use.

As a business person it is important that you are cautions when reading data and statistics. In

order to draw intelligent and logical conclusions from data you need to understand the

various meanings of statistical terms.

Role of statistics in business analysis and decision making.

In the business world, statistics has four important applications:

– To summarise business data

– To draw conclusions from that data

– To make reliable forecasts about business activities

– To improve business processes.

The field of statistics is generally divided into two areas.

Figure 3.1

Descriptive statistics allows

you to create different tables

and charts to summarise data.

It also provides statistical

measures such as the mean,

median, mode, standard

deviation etc to describe

different characteristics of the

data

Page 87

Inferential

statistics

sampling

estimation

from

sample

informed

opinion

Figure 3.2

Improving business processes involves using managerial approaches that focus on quality

improvements such as Six Sigma. These approaches are data driven and use statistical

method to develop these models.

– Presentation of data, use of bar charts, histograms, pie charts, graphs, tables,

frequency distributions, cumulative distributions, Ogives.

– Their uses and interpretations.

If you look at any magazine or newspaper article, TV show, election campaign etc you will

see many different charts depicting anything from the most popular holiday destination to the

gain in company profits. The nice thing about studying statistics is that once you understand

the concepts the theory remains the same for all situations and you can easily apply your

knowledge to whatever situation you are in.

Tables and charts for categorical data:

When you have categorical data, you tally responses into categories and then present the

frequency or percentage in each category in tables and charts.

The summary table indicates the frequency, amount or percentage of items in each category,

so that you can differentiate between the categories.

Supposing a questionnaire asked people how they preferred to do their banking:

Drawing conclusions about

your data is the fundamental

point of inferential statistics.

Using these methods allows the

researcher to draw conclusions

based on data rather than on

intuition.

Page 88

Table 3.11

Banking preference

frequency

percentage

In bank

200

20

ATM

250

25

Telephone

97

10

internet

450

45

Total

997

100

The above information could be illustrated using a bar chart

Figure 3.3

0

50

100

150

200

250

300

350

400

450

500

In bank ATM Telephone Internet

Page 89

Or a pie chart

Figure 3.4

A simple line chart is usually used for time series data, where data is given over time.

The price of an average mobile homes over the past 3 years

Table 3.12

Year

Price RWF

2008

RWF350 000

2009

RWF252 000

2010

RWF190 000

Sales

In bank

ATM

Telephone

Internet

Page 90

Raw data Grouped

Illustrated

using histogram,

ogive.

Figure 3.5

The above graphs are used for categorical data.

Numerical Data

Numerical data is generally used more in statistics. The process in which numerical data is

processed is as follows.

Figure 3.6

0

50000

100000

150000

200000

250000

300000

350000

400000

2008 2009 2010

Page 91

The Histogram:

The histogram is like a bar chart but for numerical data. The important thing to remember

about the histogram is that the area under the histogram represents or is proportionate to the

frequencies. If you are drawing a histogram for data where the class widths are all the same

then it is very easy. If however one class width is bigger or narrower than the others an

adjustment must be made to ensure that the area of the bar is proportionate to the frequency.

Page 92

BLANK

Page 93

STUDY UNIT 4

Graphical Representation of Information

Contents

Unit

Title

Page

A.

Introduction to Frequency Distributions

95

Example

95

B.

Preparation of Frequency Distributions

97

Simple Frequency Distribution

97

Grouped Frequency Distribution

97

Choice of Class Interval

99

C.

Cumulative Frequency Distributions

103

D.

Relative Frequency Distributions

105

E.

Graphical Representation of Frequency Distributions

107

Frequency Dot Diagram

107

Frequency Bar Chart

108

Frequency Polygon

109

Histogram

110

The Ogive

114

F.

Introduction to Other Types of Data Presentation

117

G.

Pictograms

119

Introduction

119

Limited Form

120

Accurate Form

121

H.

Pie Charts

123

I.

Bar Charts

125

Component Bar Chart

125

Horizontal Bar Chart

127

J.

General Rules for Graphical Presentation

129

Page 94

Unit

Title

Page

K.

The Lorenz Curve

131

Purpose

131

Stages in Construction of a Lorenz Curve

133

Interpretation of the Curve

135

Other Uses

136

Page 95

A. INTRODUCTION TO FREQUENCY

DISTRIBUTIONS

A frequency distribution is a tabulation which shows the number of times (i.e. the frequency)

each different value occurs. Refer back to Study Unit 2 and make sure you understand the

difference between "attributes" (or qualitative variables) and "variables" (or quantitative

variables); the term "frequency distribution" is usually confined to the case of variables.

Example

The following figures are the times (in minutes) taken by a shop-floor worker to perform a

given repetitive task on 20 specified occasions during the working day:

3.5 3.8 3.8 3.4 3.6

3.6 3.8 3.9 3.7 3.5

3.4 3.7 3.6 3.8 3.6

3.7 3.7 3.7 3.5 3.9

If we now assemble and tabulate these figures, we obtain a frequency distribution (see Table

4.1).

Table 4.1

Page 96

BLANK

Page 97

B. PREPARATION OF FREQUENCY

DISTRIBUTIONS

Simple Frequency Distribution

A useful way of preparing a frequency distribution from raw data is to go through the records

as they stand and mark off the items by the "tally mark" or "five-bar gate" method. First look

at the figures to see the highest and lowest values so as to decide the range to be covered and

then prepare a blank table.

Now mark the items on your table by means of a tally mark. To illustrate the procedure, the

following table shows the state of the work after all 20 items have been entered.

Table 4.2

Grouped Frequency Distribution

Sometimes the data is so extensive that a simple frequency distribution is too cumbersome

and, perhaps, uninformative. Then we make use of a "grouped frequency distribution".

In this case, the "length of time" column consists not of separate values but of groups of

values (see Table 4.3).

Page 98

Table 4.3

Grouped frequency distributions are only needed when there is a large number of values and,

in practice, would not have been required for the small amount of data in our example. Table

4.4 shows a grouped frequency distribution used in a more realistic situation, when an

ungrouped table would not have been of much use.

The various groups (e.g. "25 but less than 30") are called "classes" and the range of values

covered by a class (e.g. five years in this example) is called the "class interval".

The number of items in each class (e.g. 28 in the 25 to 30 class) is called the "class

frequency" and the total number of items (in this example, 220) is called the "total

frequency". As stated before, frequency distributions are usually only considered in

Table 4.4: Age Distribution of Workers in an

Office

Page 99

connection with variables and not with attributes, and you will sometimes come across the

term "variate" used to mean the variable in a frequency distribution. The variate in our last

example is "age of worker", and in the previous example the variate was "length of time".

The term "class boundary" is used to denote the dividing line between adjacent classes, so in

the age group example the class boundaries are 15, 20, 25, .... years. In the length of time

example, as grouped earlier in this section, the class boundaries are 3.35, 3.55, 3.75, 3.95

minutes. This needs some explanation. As the original readings were given correct to one

decimal place, we assume that is the precision to which they were measured. If we had had a

more precise stopwatch, the times could have been measured more precisely. In the first

group of 3.4 to 3.5 are put times which could in fact be anywhere between 3.35 and 3.55 if

we had been able to measure them more precisely. A time such as 3.57 minutes would not

have been in this group as it equals 3.6 minutes when corrected to one decimal place and it

goes in the 3.6 to 3.7 group.

Another term, "class limits", is used to stand for the lowest and highest values that can

actually occur in a class. In the age group example, these would be 15 years and 19 years 364

days for the first class, 20 years and 24 years 364 days for the second class and so on,

assuming that the ages were measured correct to the nearest day below. In the length of time

example, the class limits are 3.4 and 3.5 minutes for the first class and 3.6 and 3.7 minutes for

the second class.

You should make yourself quite familiar with these terms, and with others which we will

encounter later, because they are all used freely by examiners and you will not be able to

answer questions if you don’t know what the questioner means!

Choice of Class Interval

When compiling a frequency distribution you should, if possible, make the length of the class

interval equal for all classes so that fair comparison can be made between one class and

another. Sometimes, however, this rule has to be broken (official publications often lump

together the last few classes into one so as to save paper and printing costs) and then, before

we use the information, it is as well to make the classes comparable by calculating a column

showing "frequency per interval of so much", as in this example for some wage statistics:

Page 100

Notice that the intervals in the first column are:

200, 200, 400, 400, 400, 800.

These intervals let you see how the last column was compiled.

A superficial look at the original table (first two columns only) might have suggested that the

most frequent incomes were at the middle of the scale, because of the appearance of the

figure 55,000. But this apparent preponderance of the middle class is due solely to the change

in the length of the class interval, and column three shows that, in fact, the most frequent

incomes are at the bottom end of the scale, i.e. the top of the table.

You should remember that the purpose of compiling a grouped frequency distribution is to

make sense of an otherwise troublesome mass of figures. It follows, therefore, that we do not

want to have too many groups or we will be little better off; nor do we want too few groups

or we will fail to see the significant features of the distribution. As a practical guide, you will

find that somewhere between about five and 20 groups will usually be suitable.

When compiling grouped frequency distributions, we occasionally run into trouble because

some of our values lie exactly on the dividing line between two classes and we wonder which

class to put them into. For example, in the age distribution given earlier in Table 24, if we

Table 4.5

Page 101

have someone aged exactly 40 years, do we put him into the "35-40" group or into the "40-

45" group? There are two possible solutions to this problem:

a) Describe the classes as "x but less than y" as we have done in Table 24, and then there

can be no doubt.

b) Where an observation falls exactly on a class boundary, allocate half an item to each

of the adjacent classes. This may result in some frequencies having half units, but this

is not a serious drawback in practice.

The first of these two procedures is the one to be preferred.

Page 102

BLANK

Page 103

C. CUMULATIVE FREQUENCY DISTRIBUTIONS

Very often we are not especially interested in the separate class frequencies, but in the

number of items above or below a certain value. When this is the case, we form a cumulative

frequency distribution as illustrated in column three of the following table:

The cumulative frequency tells us the number of items equal to or less than the specified

value, and it is formed by the successive addition of the separate frequencies. A cumulative

frequency column may also be formed for a grouped distribution.

The above example gives us the number of items "less than" a certain amount, but we may

wish to know, for example, the number of persons having more than some quantity. This can

easily be done by doing the cumulative additions from the bottom of the table instead of the

top, and as an exercise you should now compile the "more than" cumulative frequency

column in the above example.

Table 4.6

Page 104

BLANK

Page 105

D. RELATIVE FREQUENCY DISTRIBUTIONS

All the frequency distributions which we have looked at so far in this study unit have had

their class frequencies expressed simply as numbers of items. However, remember that

proportions or percentages are useful secondary statistics. When the frequency in each class

of a frequency distribution is given as a proportion or percentage of the total frequency, the

result is known as a "relative frequency distribution" and the separate proportions or

percentages are the "relative frequencies". The total relative frequency is, of course, always

1.0 (or 100%). Cumulative relative frequency distributions may be compiled in the same way

as ordinary cumulative frequency distributions.

As an example, the distribution used in Table 4.5 is now set out as a relative frequency

distribution for you to study.

This example is in the "less than" form, and you should now compile the "more than" form in

the same way as you did for the non-relative distribution.

Table 4.7

Page 106

BLANK

Page 107

E. GRAPHICAL REPRESENTATION OF

FREQUENCY DISTRIBUTIONS

Tabulated frequency distributions are sometimes more readily understood if represented by a

diagram. Graphs and charts are normally much superior to tables (especially lengthy complex

tables) for showing general states and trends, but they cannot usually be used for accurate

analysis of data. The methods of presenting frequency distributions graphically are as

follows:

– Frequency dot diagram

– Frequency bar chart

– Frequency polygon

– Histogram

– Ogive.

We will now examine each of these in turn.

Frequency Dot Diagram

This is a simple form of graphical representation for the frequency distribution of a discrete

variate. A horizontal scale is used for the variate and a vertical scale for the frequency. Above

each value on the variate scale we mark a dot for each occasion on which that value occurs.

Thus, a frequency dot diagram of the distribution of times taken to complete a given task,

which we have used in this study unit, would look like Figure 4.1.

Page 108

Frequency Bar Chart

We can avoid the business of marking every dot in such a diagram by drawing instead a

vertical line the length of which represents the number of dots which should be there. The

frequency dot diagram in Figure 4.1 now becomes a frequency bar chart, as in Figure 4.2.

Figure 4.1: Frequency Dot Diagram to Show Length of Time Taken

by Operator to Complete a Given Task

Figure 4.2: Frequency Bar Chart

Page 109

Frequency Polygon

Instead of drawing vertical bars as we do for a frequency bar chart, we could merely mark the

position of the top end of each bar and then join up these points with straight lines. When we

do this, the result is a frequency polygon, as in Figure 4.3.

Note that we have added two fictitious classes at each end of the distribution, i.e. we have

marked in groups with zero frequency at 3.3 and 4.0.

This is done to ensure that the area enclosed by the polygon and the horizontal axis is the

same as the area under the corresponding histogram which we shall consider in the next

section.

These three kinds of diagram are all commonly used as a means of making frequency

distributions more readily comprehensible. They are mostly used in those cases where the

variate is discrete and where the values are not grouped. Sometimes frequency bar charts

and polygons are used with grouped data by drawing the vertical line (or marking its top end)

at the centre point of the group.

Figure 4.3: Frequency Polygon

Page 110

Histogram

This is the best way of graphing a grouped frequency distribution. It is of great practical

importance and is also a favourite topic among examiners. Refer back now to the grouped

distribution given earlier in Table 4.4 (ages of office workers) and then study Figure 4.5.

Figure 4.5: Histogram

Page 111

We call this kind of diagram a "histogram". The frequency in each group is represented by a

rectangle and - this is a very important point - it is the AREA of the rectangle, not its

height, which represents the frequency.

When the lengths of the class intervals are all equal, then the heights of the rectangles

represent the frequencies in the same way as do the areas (this is why the vertical scale has

been marked in this diagram); if, however, the lengths of the class intervals are not all equal,

you must remember that the heights of the rectangles have to be adjusted to give the correct

areas. Do not stop at this point if you have not quite grasped the idea, because it will become

clearer as you read on.

Look once again at the histogram of ages given in Figure 4.5 and note particularly how it

illustrates the fact that the frequency falls off towards the higher age groups - any form of

graph which did not reveal this fact would be misleading. Now let us imagine that the

original table had NOT used equal class intervals but, for some reason or other, had given the

last few groups as:

The last two groups have been lumped together as one. A WRONG form of histogram, using

heights instead of areas, would look like Figure 4.6.

Table 4.8

Page 112

Now, this clearly gives an entirely wrong impression of the distribution with respect to the

higher age groups. In the correct form of the histogram, the height of the last group (50-60)

would be halved because the class interval is double all the other class intervals. The

histogram in Figure 4.7 gives the right impression of the falling off of frequency in the higher

age groups. I have labelled the vertical axis "Frequency density per 5-year interval" as five

years is the "standard" interval on which we have based the heights of our rectangles.

Figure 4.6

Page 113

Often it happens, in published statistics, that the last group in a frequency table is not

completely specified. The last few groups may look as in Table 4.9:

Figure 4.7

Table 4.9

Page 114

How do we draw the last group on the histogram?

If the last group has a very small frequency compared with the total frequency (say, less than

about 1% or 2%) then nothing much is lost by leaving it off the histogram altogether. If the

last group has a larger frequency than about 1% or 2%, then you should try to judge from the

general shape of the histogram how many class intervals to spread the last frequency over in

order not to create a false impression of the extent of the distribution. In the example given,

you would probably spread the last 30 people over two or three class intervals but it is often

simpler to assume that an open-ended class has the same length as its neighbour. Whatever

procedure you adopt, the important thing in an examination paper is to state clearly what you

have done and why. A distribution of the kind we have just discussed is called an "open-

ended" distribution.

The Ogive

This is the name given to the graph of the cumulative frequency. It can be drawn in either the

"less than" or the "or more" form, but the "less than" form is the usual one. Ogives for two of

the distributions already considered in this study unit are now given as examples; Figure 4.8

is for ungrouped data and Figure 4.9 is for grouped data.

Study these two diagrams so that you are quite sure that you know how to draw them. There

is only one point which you might be tempted to overlook in the case of the grouped

distribution - the points are plotted at the ends of the class intervals and NOT at the centre

point. Look at the example and see how the 168,000 is plotted against the upper end of the

56-60 group and not against the mid-point, 58. If we had been plotting an "or more" ogive,

the plotting would have to have been against the lower end of the group.

Page 115

As an example of an "or more" ogive, we will compile the cumulative frequency of our

example from Section B, which for convenience is repeated below with the "more than"

cumulative frequency:

Figure 4.8

Figure 4.9

Table 4.10

Page 116

Check that you see how the plotting has been made against the lower end of the group and

notice how the ogive has a reversed shape.

In each of Figures 4.9 and 4.10 we have added a fictitious group of zero frequency at one end

of the distribution.

It is common practice to call the cumulative frequency graph a cumulative frequency polygon

if the points are joined by straight lines, and a cumulative frequency curve if the points are

joined by a smooth curve.

(N.B. Unless you are told otherwise, always compile a "less than" cumulative frequency.)

All of these diagrams, of course, may be drawn from the original figures or on the basis of

relative frequencies. In more advanced statistical work the latter are used almost exclusively

and you should practise using relative frequencies whenever possible.

Figure 4.10

The ogive now appears as shown in Figure 4.10

Page 117

F. INTRODUCTION TO OTHER TYPES OF DATA

PRESENTATION

The graphs we have seen so far in this study unit are all based on frequency distributions.

Next we shall discuss several common graphical presentations that are designed more for the

lay reader than someone with statistical knowledge. You will certainly have seen some

examples of them used in the mass media of newspapers and television.

Page 118

BLANK

Page 119

G. PICTOGRAMS

Introduction

This is the simplest method of presenting information visually. These diagrams are variously

called "pictograms", "ideograms", "picturegrams" or "isotypes" - the words all refer to the

same thing. Their use is confined to the simplified presentation of statistical data for the

general public. Pictograms consist of simple pictures which represent quantities. There are

two types and these are illustrated in the following examples. The data we will use is shown

in Table 4.11.

Table 4.11: Cruises Organised by a Shipping Line Between Year

1 and Year 3

Page 120

Limited Form

a) We could represent the number of cruises by ships of varying size, as in Figure 4.11.

b) Although these diagrams show that the number of cruises has increased each year,

they can give false impressions of the actual increases. The reader can become

confused as to whether the quantity is represented by the length or height of the

pictograms, their area on the paper, or the volume of the object they represent. It is

difficult to judge what increase has taken place. Sometimes you will find pictograms

in which the sizes shown are actually WRONG in relation to the real increases. To

avoid confusion, I recommend that you use the style of diagram shown in Figure 4.12.

Figure 4.11: Number of Cruises Years 1-3

(Source: Table 4.11)

Page 121

Accurate Form

Each matchstick man is the same height and represents 20,000 passengers, so there can be no

confusion over size.

These diagrams have no purpose other than generally presenting statistics in a simple way.

Look at Figure 4.13.

Figure 4.12: Passengers Carried Years 1-3

(Source: Table 4.11)

Figure 4.13: Imports of Crude Oil

Page 122

Here it is difficult to represent a quantity less than 10m barrels, e.g. does "[" represent 0.2m

or 0.3m barrels?

Page 123

H. PIE CHARTS

These diagrams, known also as circular diagrams, are used to show the manner in which

various components add up to a total. Like pictograms, they are only used to display very

simple information to non-expert readers. They are popular in computer graphics.

An example will show what the pie chart is. Suppose that we wish to illustrate the sales of

gas in Rwanda in a certain year. The figures are shown in Table 4.12.

The figures are illustrated in the pie chart or circular diagram in Figure 4.14.

Table 4.12: Gas Sales in Rwanda in One Year

Figure 4.14: Example of a Pie Chart (Gas Sales in Rwanda)

(Source: Table 4.12)

Page 124

c) Construct the diagram by means of a pair of compasses and a protractor. Don’t

overlook this point, because examiners dislike inaccurate and roughly drawn

diagrams.

d) Label the diagram clearly, using a separate "legend" or "key" if necessary. (A key is

illustrated in Figure 21.)

e) If you have the choice, don’t use a diagram of this kind with more than four or five

component parts.

Note: The actual number of therms can be inserted on each sector as it is not possible to read

this exactly from the diagram itself.

The main use of a pie chart is to show the relationship each component part bears to the

whole. They are sometimes used side by side to provide comparisons, but this is not really to

be recommended, unless the whole diagram in each case represents exactly the same total

amount, as other diagrams (such as bar charts, which we discuss next) are much clearer.

However, in examinations you may be asked specifically to prepare such pie charts.

Page 125

J. BAR CHARTS

We have already met one kind of bar chart in the course of our studies of frequency

distributions, namely the frequency bar chart. A "bar" is simply another name for a thick line.

In a frequency bar chart the bars represent, by their length, the frequencies of different values

of the variate. The idea of a bar chart can, however, be extended beyond the field of

frequency distributions, and we will now illustrate a number of the types of bar chart in

common use. I say "illustrate" because there are no rigid and fixed types, but only general

ideas which are best studied by means of examples. You can supplement the examples in this

study unit by looking at the commercial pages of newspapers and magazines.

Component Bar Chart

This first type of bar chart serves the same purpose as a circular diagram and, for that reason,

is sometimes called a "component bar diagram" (see Figure 4.15).

Figure 4.15: Component Bar Chart Showing Cost

of Production of ZYX Co. Ltd

Page 126

Note that the lengths of the components represent the amounts, and that the components are

drawn in the same order so as to facilitate comparison. These bar charts are preferable to

circular diagrams because:

a) They are easily read, even when there are many components.

b) They are more easily drawn.

c) It is easier to compare several bars side by side than several circles.

Bar charts with vertical bars are sometimes called "column charts" to distinguish them from

those in which the bars are horizontal (see Figure 4.16).

Figure 4.16 is also an example of a percentage component bar chart, i.e. the information is

expressed in percentages rather than in actual numbers of visitors.

If you compare several percentage component bar charts, you must be careful. Each bar chart

will be the same length, as they each represent 100%, but they will not necessarily represent

the same actual quantities, e.g. 50% might have been 1 million, whereas in another year it

may have been nearer to 4 million and in another to 8 million.

Figure 4.16: Horizontal Bar Chart of Visitors

Arriving in Rwanda in One Year

Page 127

Horizontal Bar Chart

A typical case of presentation by a horizontal bar chart is shown in Figure 4.17. Note how a

loss is shown by drawing the bar on the other side of the zero line.

Pie charts and bar charts are especially useful for "categorical" variables as well as for

numerical variables. The example in Figure 4.17 shows a categorical variable, i.e. the

different branches form the different categories, whereas in Figure 4.15 we have a numerical

variable, namely, time. Figure 4.17 is also an example of a multiple or compound bar chart as

there is more than one bar for each category.

Figure 4.17: Horizontal Bar Chart for the So and So Company Ltd to

Show Profits Made by Branches in Year 1 and Year 2

Page 128

BLANK

Page 129

K. GENERAL RULES FOR GRAPHICAL

PRESENTATION

There are a number of general rules which must be borne in mind when planning and using

graphical methods:

a) Graphs and charts must be given clear but brief titles.

b) The axes of graphs must be clearly labelled, and the scales of values clearly marked.

c) Diagrams should be accompanied by the original data, or at least by a reference to the

source of the data.

d) Avoid excessive detail, as this defeats the object of diagrams.

e) Wherever necessary, guidelines should be inserted to facilitate reading.

f) Try to include the origins of scales. Obeying this rule sometimes leads to rather a

waste of paper space. In such a case the graph could be "broken" as shown in Figure

4.18, but take care not to distort the graph by over-emphasising small variations.

Figure 4.18

Page 130

BLANK

Page 131

L. THE LORENZ CURVE

Purpose

One of the problems which frequently confronts the statistician working in economics or

industry is that of CONCENTRATION. Suppose that, in a business employing 100 men, the

total weekly wages bill is RWF10,000 and that every one of the 100 men gets RWF100; there

is then an equal distribution of wages and there is no concentration. Suppose now that, in

another business employing 100 men and having a total weekly wages bill of RWF10,000,

there are 12 highly skilled experts getting RWF320 each and 88 unskilled workers getting

RWF70 each. The wages are not now equally distributed and there is some concentration of

wages in the hands of the skilled experts. These experts number 12 out of 100 people (i.e.

they constitute 12% of the labour force); their share of the total wages bill is 12 x RWF320

(i.e. RWF3,840) out of RWF10,000, which is 38.4%. We can therefore say that 38.4% of the

firm’s wages is concentrated in the hands of only 12% of its employees.

In the example just discussed there were only two groups, the skilled and the unskilled. In a

more realistic case, however, there would be a larger number of groups of people with

different wages, as in the following example:

Wages Group (RWF) Number of People Total Wages (RWF)

0 - 80 205 10,250

80 - 120 200 22,000

120 - 160 35 4,900

160 - 200 30 5,700

200 - 240 20 4,400

240 - 280 10 2,500

500 49,750

Page 132

Obviously when we have such a set of figures, the best way to present them is to graph them,

which I have done in Figure 4.19. Such a graph is called a LORENZ CURVE. (The next

section shows how we obtain this graph.)

Figure 4.19: Lorenz Curve

Page 133

Stages in Construction of a Lorenz Curve

a) Draw up a table giving:

(i) the cumulative frequency;

(ii) the percentage cumulative frequency;

(iii)the cumulative wages total;

(iv) the percentage cumulative wages total.

b) On graph paper draw scales of 0-100% on both the horizontal and vertical axes. The

scales should be the same on both axes.

c) Plot the cumulative percentage frequency against the cumulative percentage wages

total and join up the points with a smooth curve. Remember that 0% of the employees

earn 0% of the total wages so that the curve will always go through the origin.

d) Draw in the 45∞ diagonal. Note that, if the wages had been equally distributed, i.e.

50% of the people had earned 50% of the total wages, etc., the Lorenz curve would

have been this diagonal line.

The graph is shown in Figure 4.19.

Table 4.13

Page 134

Sometimes you will be given the wages bill as a grouped frequency distribution alone,

without the total wages for each group being specified. Consider the following set of figures:

Wages Group (RWF) No. of People

0 - 40 600

40 - 80 250

80 - 120 100

120 - 160 30

160 - 200 20

1,000

As we do not know the actual wage of each person, the total amount of money involved in

each group is estimated by multiplying the number of people in the group by the mid-value of

the group; for example, the total amount of money in the "RWF40-RWF80" group is 250 x

RWF60 = RWF15,000. The construction of the table and the Lorenz curve then follows as

before. Try working out the percentages for yourself first and then check your answers with

the following table. Your graph should look like Figure 4.20.

Table 4.14

Page 135

Interpretation of the Curve

From Figure 4.20 we can read directly the share of the wages paid to any given percentage of

employees:

a) 50% of the employees earn 22% of the total wages, so we can deduce that the other

50%, i.e. the more highly paid employees, earn 78% of the total wages.

b) 90% of the employees earn 70% of the total wages, so 10% of the employees must

earn 30% of the total wages.

c) 95% of the employees earn 83% of the total wages, so 5% of the employees earn 17%

of the total wages.

Figure 4.20: Lorenz Curve

Page 136

Other Uses

Although usually used to show the concentration of wealth (incomes, property ownership,

etc.), Lorenz curves can also be employed to show concentration of any other feature. For

example, the largest proportion of a country’s output of a particular commodity may be

produced by only a small proportion of the total number of factories, and this fact can be

illustrated by a Lorenz curve.

Concentration of wealth or productivity, etc. may become more or less as time goes on. A

series of Lorenz curves on one graph will show up such a state of affairs. In some countries,

in recent years, there has been a tendency for incomes to be more equally distributed. A

Lorenz curve reveals this because the curves for successive years lie nearer to the straight

diagonal.

Page 137

STUDY UNIT 5

Averages or Measures of Location

Contents

Unit

Title

Page

A.

The Need for Measures of Location

139

B.

The Arithmetic Mean

141

Introduction

141

The Mean of a Simple Frequency Distribution

142

The Mean of a Grouped Frequency Distribution

144

Simplified Calculation

145

Characteristics of the Arithmetic Mean

151

C.

The Mode

153

Mode of a Simple Frequency Distribution

153

Mode of a Grouped Frequency Distribution

154

Characteristics of the Mode

155

D.

The Median

159

Introduction

159

Median of a Simple Frequency Distribution

160

Median of a Grouped Frequency Distribution

160

Characteristics of the Median

163

Page 138

BLANK

Page 139

A. THE NEED FOR MEASURES OF LOCATION

We looked at frequency distributions in detail in the previous study unit and you should, by

means of a quick revision, make sure that you have understood them before proceeding.

A frequency distribution may be used to give us concise information about its variate, but

more often, we will wish to compare two or more distributions. Consider, for example, the

distribution of the weights of eggs from two different breeds of poultry (which is a topic in

which you would be interested if you were the statistician in an egg marketing company).

Having weighed a large number of eggs from each breed, we would have compiled frequency

distributions and graphed the results. The two frequency polygons might well look something

like Figure 5.1.

Examining these distributions you will see that they look alike except for one thing - they are

located on different parts of the scale. In this case the distributions overlap and, although

some eggs from Breed A are of less weight than some eggs from Breed B, eggs from Breed A

are, in general, heavier than those from Breed B.

Figure 5.1

Page 140

Remember that one of the objects of statistical analysis is to condense unwieldy data so as to

make it more readily understood. The drawing of frequency curves has enabled us to make an

important general statement concerning the relative egg weights of the two breeds of poultry,

but we would now like to take the matter further and calculate some figure which will serve

to indicate the general level of the variable under discussion. In everyday life we commonly

use such a figure when we talk about the "average" value of something or other. We might

have said, in reference to the two kinds of egg, that those from Breed A had a higher average

weight than those from Breed B. Distributions with different averages indicate that there is a

different general level of the variate in the two groups. The single value which we use to

describe the general level of the variate is called a "measure of location" or a "measure of

central tendency" or, more commonly, an average.

There are three such measures with which you need to be familiar:

− The arithmetic mean

− The mode

− The median.

Page 141

B. THE ARITHMETIC MEAN

Introduction

This is what we normally think of as the "average" of a set of values. It is obtained by adding

together all the values and then dividing the total by the number of values involved. Take, for

example, the following set of values which are the heights, in inches, of seven men:

Man Height (ins)

A 74

B 63

C 64

D 71

E 71

F 66

G 74

Total 483

The arithmetic mean of these heights is 483 ÷ 7 = 69 ins. Notice that some values occur more

than once, but we still add them all.

At this point we must introduce a little algebra. We don’t always want to specify what

particular items we are discussing (heights, egg weights, wages, etc.) and so, for general

discussion, we use, as you will recall from algebra, some general letter, usually x. Also, we

indicate the sum of a number of x’s by Σ (sigma).

Thus, in our example, we may write:

Σx = 483

Page 142

We indicate the arithmetic mean by the symbol

x

(called "x bar") and the number of items

by the letter n. The calculation of the arithmetic mean can be described by formula thus:

The last one is customary in statistical work. Applying it to the example above, we have:

You will often find the arithmetic mean simply referred to as "the mean" when there is no

chance of confusion with other means (which we are not concerned with here).

The Mean of a Simple Frequency Distribution

When there are many items (i.e. when n is large) the arithmetic can be eased somewhat by

forming a frequency distribution, like this:

Table 5.1

X

Page 143

Table 5.2

Indicating the frequency of each value by the letter f, you can see that Sf = n and that, when

the x’s are not all the separate values but only the different ones, the formula becomes:

Of course, with only seven items it would not be necessary, in practice, to use this method,

but if we had a much larger number of items the method would save a lot of additions.

QUESTION FOR PRACTICE

a) Consider now Table 5.2. Complete the (fx) column and calculate the value of the

arithmetic mean,

x

.

Page 144

You should have obtained the following answers:

The total number of items, ∑f = 100

The total product, ∑(fx) = 713

The arithmetic mean,

Make sure that you understand this study unit so far. Revise it if necessary, before going on

to the next paragraph. It is most important that you do not get muddled about calculating

arithmetic means.

The Mean of a Grouped Frequency Distribution

Suppose now that you have a grouped frequency distribution. In this case, you will

remember, we do not know the actual individual values, only the groups in which they lie.

How, then, can we calculate the arithmetic mean? The answer is that we cannot calculate the

exact value of

x

, but we can make an approximation sufficiently accurate for most statistical

purposes. We do this by assuming that all the values in any group are equal to the mid-point

of that group.

The procedure is very similar to that for a simple frequency distribution (which is why I

Provided that Σf is not less than about 50 and that the number of groups is not less than about

12, the arithmetic mean thus calculated is sufficiently accurate for all practical purposes.

Table 5.3

Group

0 ‹ 10

10 ‹ 20

20 ‹ 30

30 ‹ 40

40 ‹ 50

50 ‹ 60

60 ‹ 70

Page 145

There is one pitfall to be avoided when using this method; if all the groups should not have

the same class interval, be sure that you get the correct mid-values! The following is part of a

table with varying class intervals, to illustrate the point:

You will remember that in discussing the drawing of histograms we had to deal with the case

where the last group was not exactly specified. The same rules for drawing the histogram

apply to the calculation of the arithmetic mean.

Simplified Calculation

It is possible to simplify the arithmetic still further by the following two devices:

a) Work from an assumed mean in the middle of one convenient class.

b) Work in class intervals instead of in the original units.

Let us consider device (a). If you go back to our earlier examples you will discover after

some arithmetic that if you add up the differences in value between each reading and the true

mean, then these differences add up to zero.

Group

Mid Value (x)

0

‹

10

5

10

‹

20

15

20

‹

40

30

40

‹

60

50

60

‹

100

80

Table 5.4

Page 146

Take first the height distribution discussed at the start of Section B:

x

= 69 ins

Secondly, consider the grouped frequency distribution given earlier in this section:

x

= 33.2

Table 5.5

Table 5.6

Group

0 ‹ 10

10 ‹ 20

20 ‹ 30

30 ‹ 40

40 ‹ 50

50 ‹ 60

60 ‹ 70

Page 147

If we take any value other than

x

and follow the same procedure, the sum of the differences

(sometimes called deviations) will not be zero. In our first example, let us assume the mean to

be 68 ins and label the assumed mean xo. The differences between each reading and this

assumed value are:

We make use of this property and we use this method as a "short-cut" for finding

x

. Firstly,

we have to choose some value of x as an assumed mean. We try to choose it near to where

we think the true mean, x, will lie, and we always choose it as the mid-point of one of the

groups when we are involved with a grouped frequency distribution. In the above example,

the total deviation, d, does not equal zero, so 68 cannot be the true mean. As the total

deviation is positive, we must have UNDERESTIMATED in our choice of xo, so the true

mean is higher than 68. As there are seven readings, we need to adjust xo upwards by one

seventh of the total deviation, i.e. by (+7)/7 = +1. Therefore the true value of

x

is:

We know this to be the correct answer from our earlier work.

Let us now illustrate the "short-cut" method for the grouped frequency distribution. We shall

take xo as 35 as this is the mid-value in the centre of the distribution.

Table 5.7

Page 148

This time we must have OVERESTIMATED xo, as the total deviation, Σfd, is negative. As

there are 50 readings altogether, the true mean must be

50

1

th of the (-90) lower than 35, i.e.

which is as we found previously.

Device (b) can be used with a grouped frequency distribution to work in units of the class

interval instead of in the original units. In the fourth column of Table 43, you can see that all

the deviations are multiples of 10, so we could have worked in units of 10 throughout and

then compensated for this at the end of the calculation.

Let us repeat the calculation using this method. The result (with xo = 35) is:

Table 5.8

Group

0 ‹ 10

10 ‹ 20

20 ‹ 30

30 ‹ 40

40 ‹ 50

50 ‹ 60

Page 149

The symbol used for the length of the class interval is c, but you may also come across the

symbol i used for this purpose.

Table 5.9

Group

0 ‹ 10

10 ‹ 20

20 ‹ 30

30 ‹ 40

40 ‹ 50

50 ‹ 60

60 ‹ 70

xo

xo

Page 150

As we mentioned at an earlier stage, you have to be very careful if the class intervals are

unequal, because you can only use one such interval as your working unit. Table 5.10 shows

you how to deal with this situation.

The assumed mean is 35, as before, and the working unit is a class interval of 10. Notice how

d for the last group is worked out; the mid-point is 60, which is 21/2 times 10 above the

assumed mean. The required arithmetic mean is, therefore:

We have reached a slightly different figure from before because of the error introduced by the

coarser grouping in the "50-70" region.

The method just described is of great importance both in work day statistics and in

examinations. By using it correctly, you can often do the calculations for very complicated-

looking distributions by using mental arithmetic and pencil and paper.

With the advent of electronic calculators, the time saving on calculations of the arithmetic

mean is not great, but this method is still preferable because:

• The numbers involved are smaller and thus you are less likely to make slips in

arithmetic.

• The method can be extended to enable us to find easily the standard deviation of a

frequency distribution.

Table 5.10

Group

0 ‹ 10

10 ‹ 20

20 ‹ 30

30 ‹ 40

40 ‹ 50

50 ‹ 60

60 ‹ 70

xo

Page 151

Characteristics of the Arithmetic Mean

There are a number of characteristics of the arithmetic mean which you must know and

understand. Apart from helping you to understand the topic more thoroughly, the following

are the points which an examiner expects to see when he or she asks for "brief notes" on the

arithmetic mean:

a) It is not necessary to know the value of every item in order to calculate the arithmetic

mean. Only the total and the number of items are needed. For example, if you know

the total wages bill and the number of employees, you can calculate the arithmetic

mean wage without knowing the wages of each person.

b) It is fully representative because it is based on all, and not only some, of the items in

the distribution.

c) One or two extreme values can make the arithmetic mean somewhat unreal by their

influence on it. For example, if a millionaire came to live in a country village, the

inclusion of his income in the arithmetic mean for the village would make the place

seem very much better off than it really was!

d) The arithmetic mean is reasonably easy to calculate and to understand.

e) In more advanced statistical work it has the advantage of being amenable to algebraic

manipulation.

Page 152

QUESTION FOR PRACTICE

1) Table 5.11 shows the consumption of electricity of 100 householders during a

particular week. Calculate the arithmetic mean consumption of the 100 householders.

Table 5.11

Page 153

C. THE MODE

Mode of a Simple Frequency Distribution

The first alternative to the mean which we will discuss is the mode. This is the name given to

the most frequently occurring value. Look at the following frequency distribution:

In this case the most frequently occurring value is 1 (it occurred 39 times) and so the mode of

this distribution is 1. Note that the mode, like the mean, is a value of the variate, x, not the

frequency of that value. A common error is to say that the mode of the above distribution is

39. THIS IS WRONG. The mode is 1. Watch out, and do not fall into this trap!

For comparison, calculate the arithmetic mean of the distribution: it works out at 1.52. The

mode is used in those cases where it is essential for the measure of location to be an actually

occurring value. An example is the case of a survey carried out by a clothing store to

determine what size of garment to stock in the greatest quantity. Now, the average size of

garment in demand might turn out to be, let us say, 9.3724, which is not an actually occurring

value and doesn’t help us to answer our problem. However, the mode of the distribution

obtained from the survey would be an actual value (perhaps size 8) and it would provide the

answer to the problem.

Table 5.12

Page 154

Mode of a Grouped Frequency Distribution

When the data is given in the form of a grouped frequency distribution, it is not quite so

easy to determine the mode. What, you might ask, is the mode of the following distribution?

All we can really say is that "70 ‹ 80" is the modal group (the group with the largest

frequency). You may be tempted to say that the mode is 75, but this is not true, nor even a

useful approximation in most cases. The reason is that the modal group depends on the

method of grouping, which can be chosen quite arbitrarily to suit our convenience. The

distribution could have been set out with class intervals of five instead of 10, and would then

have appeared as follows (only the middle part is shown, to illustrate the point):

Table 5.13

Table 5.14

Group

0 ‹ 10

10 ‹ 20

20 ‹ 30

30 ‹ 40

40 ‹ 50

50 ‹ 60

60 ‹ 70

70 ‹ 80

80 ‹ 90

90 ‹ 100

100 ‹ 110

110 ‹ 120

Page 155

The modal group is now "65-70". Likewise, we will get different modal groups if the

grouping is by 15 or by 20 or by any other class interval, and so the mid-point of the modal

group is not a good way of estimating the mode.

In practical work, this determination of the modal group is usually sufficient, but examination

papers occasionally ask for the mode to be determined from a grouped distribution.

A number of procedures based on the frequencies in the groups adjacent to the modal group

can be used, and I will now describe one procedure. You should note, however, that these

procedures are only mathematical devices for finding the MOST LIKELY position of the

mode; it is not possible to calculate an exact and true value in a grouped distribution.

We saw that the modal group of our original distribution was "70-80". Now examine the

groups on each side of the modal group; the group below (i.e. 60-70) has a frequency of 38,

and the one above (i.e. 80-90) has a frequency of 20. This suggests to us that the mode may

be some way towards the lower end of the modal group rather than at the centre. A graphical

method for estimating the mode is shown in Figure 5.2.

This method can be used when the distribution has equal class intervals. Draw that part of the

histogram which covers the modal class and the adjacent classes on either side.

Draw in the diagonals AB and CD as shown in Figure 5.2. From the point of intersection

draw a vertical line downwards. Where this line crosses the horizontal axis is the mode. In

our example the mode is just less than 71.

Page 156

Figure 5.2

Page 157

Characteristics of the Mode

Some of the characteristics of the mode are worth noting as you may well be asked to

compare them with those of the arithmetic mean.

a) The mode is very easy to find with ungrouped distributions, since no calculation is

required.

b) It can only be determined roughly with grouped distributions.

c) It is not affected by the occurrence of extreme values.

d) Unlike the arithmetic mean, it is not based on all the items in the distribution, but only

on those near its value.

e) In ungrouped distributions the mode is an actually occurring value.

f) It is not amenable to the algebraic manipulation needed in advanced statistical work.

g) It is not unique, i.e. there can be more than one mode. For example, in the set of

numbers, 6, 7, 7, 7, 8, 8, 9, 10, 10, 10, 12, 13, there are two modes, namely 7 and 10.

This set of numbers would be referred to as having a bimodal distribution.

h) The mode may not exist. For example, in the set of numbers 7, 8, 10, 11, 12, each

number occurs only once so this distribution has no mode.

Page 158

BLANK

Page 159

D. THE MEDIAN

Introduction

The desirable feature of any measure of location is that it should be near the middle of the

distribution to which it refers. Now, if a value is near the middle of the distribution, then we

expect about half of the distribution to have larger values, and the other half to have smaller

values. This suggests to us that a possible measure of location might be that value which is

such that exactly half (i.e. 50%) of the distribution has larger values and exactly half has

lower values. The value which so divides the distribution into equal parts is called the

MEDIAN. Look at the following set of values:

6, 7, 7, 8, 8, 9, 10, 10, 10, 12, 13

The total of these eleven numbers is 100 and the arithmetic mean is therefore 100/11 = 9.091,

while the mode is 10 because that is the number which occurs most often (three times). The

median, however, is 9 because there are five values above and five values below 9. Our first

rule for determining the median is therefore as follows:

Arrange all the values in order of magnitude and the median is then the middle value.

Note that all the values are to be used: even though some of them may be repeated, they must

all be put separately into the list. In the example just dealt with, it was easy to pick out the

middle value because there was an odd number of values. But what if there is an even

number? Then, by convention, the median is taken to be the arithmetic mean of the two

values in the middle. For example, take the following set of values:

6, 7, 7, 8, 8, 9, 10, 10, 11, 12

The two values in the middle are 8 and 9, so that the median is 8.5

Page 160

Median of a Simple Frequency Distribution

Statistical data, of course, is rarely in such small groups and, as you have already learned, we

usually deal with frequency distributions. How, then do we find the median if our data is in

the form of a distribution?

Let us take the example of the frequency distribution of accidents already used in discussing

the mode. The total number of values is 123 and so when those values are arranged in order

of magnitude, the median will be the 62nd item because that will be the middle item. To see

what the value of the 62nd item will be, let us again draw up the distribution:

You can see from the last column that, if we were to list all the separate values in order, the

first 27 would all be 0s and from then up to the 66th would be 1s; it follows therefore that the

62nd item would be a 1 and that the median of this distribution is 1.

Median of a Grouped Frequency Distribution

The final problem connected with the median is how to find it when our data is in the form of

a grouped distribution. The solution to the problem, as you might expect, is very similar to

the solution for an ungrouped distribution; we halve the total frequency and then find, from

the cumulative frequency column, the corresponding value of the variate.

Table 5.15

Page 161

Because a grouped frequency distribution nearly always has a large total frequency, and

because we do not know the exact values of the items in each group, it is not necessary to

find the two middle items when the total frequency is even: just halve the total frequency and

use the answer (whether it is a whole number or not) for the subsequent calculation.

The total frequency is 206 and therefore the median is the 103rd item which, from the

cumulative frequency column, must lie in the 60-70 group. But exactly where in the 60-70

group? Well, there are 92 items before we get to that group and we need the 103rd item, so

we obviously need to move into that group by 11 items. Altogether in our 60-70 group there

are 38 items so we need to move 11/38 of the way into that group, that is 11/38 of 10 above

60. Our median is therefore

60 + 110/38 = 60 + 2.89 = 62.89.

The use of the cumulative frequency distribution will, no doubt, remind you of its graphical

representation, the ogive. In practice, a convenient way to find the median of a grouped

distribution is to draw the ogive and then, against a cumulative frequency of half the total

frequency, to read off the median. In our example the median would be read against 103 on

the cumulative frequency scale (see Figure 5.3). If the ogive is drawn with relative

frequencies, then the median is always read off against 50%.

Table 5.16

Group

0 ‹ 10

10 ‹ 20

20 ‹ 30

30 ‹ 40

40 ‹ 50

50 ‹ 60

60 ‹ 70

70 ‹ 80

80 ‹ 90

90 ‹ 100

100 ‹ 110

110 ‹ 120

Page 162

Figure 5.3

Page 163

Characteristics of the Median

Characteristic features of the median, which you should compare with those of the mean and

the mode, are as follows:

a) It is fairly easily obtained in most cases, and is readily understood as being the "half-

way point".

b) It is less affected by extreme values than the mean. The millionaire in the country

village might alter considerably the mean income of the village but he would have

almost no effect at all on the median.

c) It can be obtained without actually having all the values. If, for example, we want to

know the median height of a group of 21 men, we do not have to measure the height

of every single one; it is only necessary to stand the men in order of their heights and

then only the middle one (No. 11) need be measured, for his height will be the median

height. The median is thus of value when we have open-ended classes at the edges of

the distribution as its calculation does not depend on the precise values of the variate

in these classes, whereas the value of the arithmetic mean does.

d) The median is not very amenable to further algebraic manipulation.

Page 164

BLANK

Page 165

STUDY UNIT 6

Measures of Dispersion

Contents

Unit

Title

Page

A.

Introduction to Dispersion

167

B.

The Range

169

C.

The Quartile Deviation, Deciles and Percentiles

171

The Quartile Deviation

171

Calculation of the Quartile Deviation

173

Deciles and Percentiles

175

D.

The Standard Deviation

177

The Variance

177

Standard Deviation of a Simple Frequency Distribution

178

Standard Deviation of a Grouped Frequency Distribution

178

Characteristics of the Standard Deviation

181

E.

The Coefficient of Variation

183

F.

Skewness

185

G.

Averages & Measures of Dispersion

189

Measures of Central Tendency and Dispersion

189

The mean and Standard Deviation

192

The Standard Deviation

194

The Median and the Quartiles

196

The Mode

199

Dispersion and Skewness

202

Page 166

BLANK

Page 167

A. INTRODUCTION TO DISPERSION

In order to get an idea of the general level of values in a frequency distribution, we have

studied the various measures of location that are available. However, the figures

which go to make up a distribution may all be very close to the central value, or they may

be widely dispersed about it, e.g. the mean of 49 and 51 is 50, but the mean of 0 and 100 is

also 50! You can see, therefore, that two distributions may have the same mean but the

individual values may be spread about the mean in vastly different ways.

When applying statistical methods to practical problems, a knowledge of this spread

(which we call "dispersion" or "variation") is of great importance. Examine the figures in

the following table:

Although the two factories have the same mean output, they are very different in their

week-to-week consistency. Factory A achieves its mean production with only very little

variation from week to week, whereas Factory B achieves the same mean by erratic ups-

and-downs from week to week. This example shows that a mean (or other measure of

location) does not, by itself, tell the whole story and we therefore need to supplement it with

a "measure of dispersion".

Table 6.1

Page 168

As was the case with measures of location, there are several different measures of dispersion

in use by statisticians. Each has its own particular merits and demerits, which will be

discussed later. The measures in common use are:

− Range

− Quartile deviation

− Mean deviation

− Standard deviation

−

We will discuss three of these here.

Page 169

B. THE RANGE

This is the simplest measure of dispersion; it is simply the difference between the

largest and the smallest. In the example just given, we can see that the lowest weekly

output for Factory A was 90 and the highest was 107; the range is therefore 17. For Factory

B the range is 156 – 36 = 120. The larger range for Factory B shows that it performs less

consistently than Factory A.

The advantage of the range as a measure of the dispersion of a distribution is that it is very

easy to calculate and its meaning is easy to understand. For these reasons it is used a great

deal in industrial quality control work. Its disadvantage is that it is based on only two of the

individual values and takes no account of all those in between. As a result, one or

two extreme results can make it quite unrepresentative. Consequently, the range is not

much used except in the case just mentioned.

Page 170

BLANK

Page 171

C. THE QUARTILE DEVIATION, DECILES AND

PERCENTILES

The Quartile Deviation

This measure of dispersion is sometimes called the "semi-interquartile range". To understand

it, you must cast your mind back to the method of obtaining the median from the ogive. The

median, you remember, is the value which divides the total frequency into two halves. The

values which divide the total frequency into quarters are called quartiles and they can also

be found from the ogive, as shown in Figure 6.1.

Figure 6.1

Page 172

––––

This is the same ogive that we drew earlier when finding the median of the grouped

frequency distribution featured in Section D of the previous study unit.

You will notice that we have added the relative cumulative frequency scale to the right of

the graph. 100% corresponds to 206, i.e. the total frequency. It is then easy to read off the

values of the variate corresponding to 25%, 50% and 75% of the cumulative frequency,

giving the lower quartile (Q1), the median and the upper quartile (Q3) respectively.

Q1 = 46.5

Median = 63 (as found previously)

Q3 = 76

The difference between the two quartiles is the interquartile range and half of the

difference is the semi-interquartile range or quartile deviation:

Alternatively, you can work out 25% of the total frequency, i.e. 206 = 51.5and 75% of

4

the total frequency, i.e. 154.5, and read from the ogive the values of the variate

corresponding to 51.5 and 154.5 on the cumulative frequency scale (i.e. the left-hand

scale). The end result is the same.

Page 173

Calculation of the Quartile Deviation

The quartile deviation is not difficult to calculate and some examination questions may

specifically ask for it to be calculated, in which case a graphical method is not acceptable.

Graphical methods are never quite as accurate as calculations.

We shall again use the same example.

The table of values is reproduced for convenience:

We can make the calculations in exactly the same manner as we used for calculating the

median - we saw this in Section D of the previous study unit.

Table 6.2

Group

0 ‹ 10

10 ‹ 20

20 ‹ 30

30 ‹ 40

40 ‹ 50

50 ‹ 60

60 ‹ 70

70 ‹ 80

80 ‹ 90

90 ‹ 100

100 ‹ 110

110 ‹ 120

Page 174

Looking at Table 6.2, the 51½th item comes in the 40-50 group and will be the (51½ – 36) =

15½th item within it.

Similarly, the upper quartile will be the 154th item which is in the 70-80 group and is the

(154 – 130) = 24th item within it.

Remember that the units of the quartiles and of the median are the same as those of the

variate.

The quartile deviation is unaffected by an occasional extreme value. It is not based,

however, on the actual value of all the items in the distribution and to this extent it is less

representative than the standard deviation. In general, when a median is the appropriate

measure of location then the quartile deviation should be used as the measure of

dispersion.

Page 175

Deciles and Percentiles

It is sometimes convenient, particularly when dealing with wages and employment

statistics, to consider values similar to the quartiles but which divide the distribution more

finely. Such partition values are deciles and percentiles. From their names you will

probably have guessed that the deciles are the values which divide the total frequency into

tenths and the percentiles are the values which divide the total frequency into hundredths.

Obviously it is only meaningful to consider such values when we have a large total

frequency.

The deciles are labelled D1, D2 ... D9: the second decile D2, for example, is the value below

which 20% of the data lies and the sixth decile D6 is the value below which 60% of the data

lies.

The percentiles are labelled P1, P2 ... P99 and, for example, P5 is the value below which 5%

of the data lies and P64 is the value below which 64% of the data lies.

Using the same example as above, let us calculate, as an illustration, the third decile D3.

The method follows exactly the same principles as the calculation of the median and

quartiles.

so we are looking for the value of the 61.8th item. A glance at the cumulative frequency

column shows that the 61.8th item lies in the 50-60 group, and is the (61.8 – 60) = 1.8th

item within it.

So,

Therefore 30% of our data lies below 50.6.

Page 176

We could also have found this result graphically; again check that you agree with the

calculation by reading D3 from the graph. You will see that the calculation method enables

us to give a more precise answer than is obtainable graphically.

Page 177

D. THE STANDARD DEVIATION

Most important of the measures of dispersion is the standard deviation. Except for the use

of the range in statistical quality control and the use of the quartile deviation in wages

statistics, the standard deviation is used almost exclusively in statistical practice. It is

defined as the square root of the variance and so we need to know how to calculate the

variance first.

The Variance

We start by finding the deviations from the mean, and then squaring them, which

removes the negative signs in a mathematically acceptable fashion, thus:

Table 6.3

Page 178

Standard Deviation of a Simple Frequency Distribution

If the data had been given as a frequency distribution (as is often the case) then only the

different values would appear in the "x" column and we would have to remember to

multiply each result by its frequency:

Standard Deviation of a Grouped Frequency Distribution

When we come to the problem of finding the standard deviation of a grouped frequency

distribution, we again assume that all the readings in a given group fall at the mid-point of

the group, so we can find the arithmetic mean as before. Let us use the following

distribution, with the mean deviation,

x = 41.7.

Table 6.4

Page 179

SD =

89.228

= 15.13

The arithmetic is rather tedious even with an electronic calculator, but we can extend the

"short-cut" method which we used for finding the arithmetic mean of a distribution, to

find the standard deviation as well. In that method we:

− Worked from an assumed mean.

− Worked in class intervals.

− Applied a correction to the assumed mean.

Table 6.5

Class

10 ‹ 20

20 ‹ 30

30 ‹ 40

40 ‹ 50

50 ‹ 60

60 ‹ 70

70 ‹ 80

Page 180

Table 6.6 shows you how to work out the standard deviation.

The standard deviation is calculated in four steps from this table, as follows:

Table 6.6

Page 181

This may seem a little complicated, but if you work through the example a few times, it will

all fall into place. Remember the following points:

a) Work from an assumed mean at the mid-point of any convenient class.

b) The correction is always subtracted from the approximate variance.

c) As you are working in class intervals, it is necessary to multiply by the class interval

as the last step.

d) The correction factor is the same as that used for the "short-cut" calculation of the

mean, but for the SD it has to be squared.

e) The column for d2 may be omitted since fd2 = fd multiplied by d. But do not omit

it until you have really grasped the principles involved.

f)

g) The assumed mean should be chosen from a group with the most common interval

and c will be that interval. If the intervals vary too much, we revert to the basic

formula.

Characteristics of the Standard Deviation

In spite of the apparently complicated method of calculation, the standard deviation is the

measure of dispersion used in all but the very simplest of statistical studies. It is based on

all of the individual items, it gives slightly more emphasis to the larger deviations but does

not ignore the smaller ones and, most important, it can be treated mathematically in more

advanced statistics.

Page 182

BLANK

Page 183

E. THE COEFFICIENT OF VARIATION

Suppose that we are comparing the profits earned by two businesses. One of them may

be a fairly large business with average monthly profits of RWF50,000, while the other

may be a small firm with average monthly profits of only RWF2,000. Clearly, the general

level of profits is very different in the two cases, but what about the month-by-month

variability? We will compare the two firms as to their variability by calculating the two

standard deviations; let us suppose that they both come to RWF500. Now, RWF500 is a

much more significant amount in relation to the small firm than it is in relation to the large

firm so that, although they have the same standard deviations, it would be unrealistic to say

that the two businesses are equally consistent in their month-to-month earnings of

profits. To overcome the difficulty, we express the SD as a percentage of the mean in each

case and we call the result the "coefficient of variation".

Applying the idea to the figures which we have just quoted, we get coefficients of variation

(usually indicated in formulae by V or CV) as follows:

This shows that, relatively speaking, the small firm is more erratic in its earnings than the

large firm.

Note that although a standard deviation has the same units as the variate, the coefficient of

variation is a ratio and thus has no units.

Another application of the coefficient of variation comes when we try to compare

distributions the data of which are in different units as, for example, when we try to

compare a French business with an American business. To avoid the trouble of converting

the dollars to euro (or vice versa) we can calculate the coefficients of variation in each case

and thus obtain comparable measures of dispersion.

Page 184

BLANK

Page 185

F. SKEWNESS

When the items in a distribution are dispersed equally on each side of the mean, we say

that the distribution is symmetrical. Figure 6.2 shows two symmetrical distributions.

When the items are not symmetrically dispersed on each side of the mean, we say that the

distribution is skew or asymmetric.

A distribution which has a tail drawn out to the right is said to be positively skew, while one

with a tail to the left, is negatively skew. Two distributions may have the same mean and the

same standard deviation but they may be differently skewed. This will be obvious if you

look at one of the skew distributions in Figure 6.3 and then look at the same one through

from the other side of the paper!

Figure 6.2

Figure 6.3

Page 186

What, then, does skewness tell us? It tells us that we are to expect a few unusually high

values in a positively skew distribution or a few unusually low values in a negatively skew

distribution.

If a distribution is symmetrical, the mean, mode and median all occur at the same point, i.e.

right in the middle. But in a skew distribution the mean and the median lie somewhere along

the side of the "tail", although the mode is still at the point where the curve is highest.

The more skewed the distribution, the greater the distance from the mode to the mean

and the median, but these two are always in the same order; working outwards from the

mode, the median comes first and then the mean - see Figure 6.4.

For most distributions, except for those with very long tails, the following

relationship holds approximately:

Mean – Mode = 3(Mean – Median)

Figure 6.4

Page 187

The more skew the distribution, the more spread out are these three measures of location,

and so we can use the amount of this spread to measure the amount of skewness. The most

usual way of doing this is to calculate:

You are expected to use one of these formulae when an examiner asks for the

skewness (or "coefficient of skewness", as some of them call it) of a distribution. When

you do the calculation, remember to get the correct sign (+ or –) when subtracting the mode

or median from the mean and then you will get negative answers from negatively skew

distributions, and positive answers for positively skew distributions. The value of the

coefficient of skewness is between –3 and +3, although values below –1 and above +1 are

rare and indicate very skewed distributions.

Examples of variates with positive skew distributions include size of incomes of a large

group of workers, size of households, length of service in an organisation, and age of a

workforce. Negative skew distributions occur less frequently. One such example is the age at

death for the adult population in Rwanda.

Page 188

BLANK

Page 189

The Arithmetic Mean (usually called the

mean).

The Median

The Mode.

G. AVERAGES AND MEASURES OF DISPERSION

Measures of Central Tendency and Dispersion

• Averages and variations for ungrouped and grouped data.

• Special cases such as the Harmonic mean and the geometric mean

In the last section we described data using graphs, histograms and Ogives mainly for grouped

numerical data. Sometimes we do not want a graph; we want one figure to describe the data.

One such figure is called the average. There are three different averages, all summarise the

data with just one figure but each one has a different interpretation.

When describing data the most obvious way and the most common way is to get an average

figure. If I said the average amount of alcohol consumed by Rwandan women is 2.6 units per

week then how useful is this information? Usually averages on their own are not much use;

you also need a measure of how spread out the data is. We will deal with the spread of the

data later.

If you take the following 11 results. Each of the figures represents a student’s results.

x

= 10, 55, 65, 30, 89, 5, 87, 60, 55, 37, 35.

Figure 6.5

Page 190

What is the average mark?

From the above example concerning student’s results, the mean figure is less than the median

figure so if you wished to give the impression to your boss that the results were good you

would use the median as the average rather than the mean. In business therefore when quoted

an average number you need to be aware which one is being used.

The range = largest number – smallest number = 89 – 5 = 84 gives an idea of how spread out

the data is. This is a useful figure if trying to analyse what the mean is saying. In this case it

would show that the spread of results was very wide and that perhaps it might be better to

divide the class or put on extra classes in future. Remember that the statistics only give you

the information; it is up to you to interpret them. Usually in order to interpret them correctly

you need to delve into the data more and maybe do some further qualitative research.

Question

A random sample of 5 weeks showed that a cruise agency received

the following number of weekly specials to the Caribbean:

20 73 75 80 82

(a) Compute the mean, median and mode

(b) Which measure of central tendency best describes the data?

Page 191

Different

AVERAGES

ARITHMETIC

MEAN

MEDIAN

Mode

What is the best average, if any, to use in each of the following situations? Justify each

of your answers.

(a) To establish a typical wage to be used by an employer in wage negotiations for a small

company of 300 employees, a few of whom are very highly paid specialists.

(b) To determine the height to construct a bridge (not a draw bridge) where the distribution of

the heights of all ships which would pass under is known and is skewed to the right.

There are THREE different measures of AVEARGE, and three different measures of

dispersion. Once you know the mean and the standard deviation you can tell much more

about the data than if you have the average only.

Figure 6.6

Page 192

Measures

of

dispersion

Standard

deviation

quartile

deviation

None

associated

with the

mode

The Mean and Standard Deviation.

This is very important.

The mean of grouped data is more complex than for raw data because you do not have the

raw figures in front of you, they have already been grouped. To find the mean therefore you

need to find the midpoint of each group and then apply the following formula:

Mean =

∑

∑

f

fx

where x represents the midpoint of each class and f represents the frequency

of that class. Note that if you are given an open ended class then you must decide yourself

what that the mid- point is. The midpoint between 5 <10 = 7.5. The midpoint of a class <10

you could say is 5 or 8 or whatever you want below 10, it depends on what you decide is the

lower bound of the class.

If you need to get the midpoint of a class 36<56 the easiest way is to add 36+56 and divide

by 2 = 46.

Like all maths you just need to understand one example and then all the others follow the

same pattern. You do need to understand what you are doing though because in your exam

Figure 6.7

Page 193

you may get a question which has a slight trick and you need to be confident enough to figure

out the approach necessary to continue.

Using the example we had in the last section on statistics grades, we will now work out the

average grade.

Results f X Mid point fx

0 but less than 20

8

10

80

20 but less than 30

9

25

225

30 but less than 40

11

35

385

40 but less than 50 14 45 630

50 but less than 60

11

55

605

60 but less than 70

10

65

650

70 but less than 80

9

75

675

80 but less than 90

6

85

510

90 but less than 100

2

95

190

Total 80 3950

The mean score from the grouped data is given by the letter

38.49

80

3950 === ∑

∑f

fx

µ

The Do-It-Better Manufacturing Company operates a shift loading system whereby 60

employees work a range of hours depending on company demands. The following data was

collected:

Hours worked No. of employees

16 < 20 1

20 < 24 2

24 < 28 3

28 < 32 11

32 < 36 14

36 < 40 12

40 < 44 9

44 < 48 5

48 < 52 3

Table 6.6

Table 6.7

Page 194

The Standard Deviation

The next thing to estimate is the standard deviation. This is one figure which gives an

indication of how spread out the data is. In the above example the number of hours worked is

between 16 and 52 which is not that spread out so the standard deviate should be about 7( a

rule of thumb is that 3 standard deviations should bring you from the mean to the highest or

lowest figure in the data set). The mean here is 36, so if we take 36-16 = 20 and divide by 3

we get 7 approx., or we could take 52-36 =16 /3 =5.3. So we take the bigger figure. However

this is just a simple estimate and not sufficient for your exam. For your exam you need to

apply the formula so you need to be able to work through it.

S.D =

( )X X

n

−

∑

2

S.D =

f X X

f

( )−

∑∑

2

We will work through an example for finding the standard deviation for raw data first:

Find the standard deviation of the following 5 numbers:

X= 10, 20, 30, 40, 50.

The mean is 30.

Using the table below: The standard deviation equals:

5

1000

=

14.14200

=

Mid pt Mean Deviations

X

10 30 -20 400

20 30 -10 100

30 30 0 0

40 30 10 100

50 30 20 400

1000

Raw data

Grouped

data

2

)( XX −

X

XX −

Page 195

To work out the standard deviation for the grouped data using the example of the statistics

score we use the formula for the grouped data which is nearly the same as for the raw data

except you need to take into account the frequency with which each group score occurs.

To work out the standard deviation you continue using the same table as before. Look at the

headings on each column. It follows the formula. You need to practice this.

So the standard deviation for the statistics scores is: 57.22

80

75.40768 =

Table 6.8

Page 196

The Median and the Quartiles.

The median is the figure where half the values of the data set lie below this figure & half

above. In a class of students the median age would be the age of the person where half the

class is younger than this person and half older. It is the age of the middle aged student.

If you had a class of 11 students, to find the median age, you would line up all the students

starting with the youngest to the oldest. You would then count up to the middle person, the 5th

one along, ask them their age and that is the median age.

∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆

To find the median of raw data you need to firstly rank the figures from smallest to highest

and then choose the middle figure.

For grouped data it is not as easy to rank the data because you don’t have single figures you

have groups. There is a formula which can be used or the median can be found from the

ogive. From the ogive, you go to the half way point on the vertical axis (if this is already in

percentages then up to 50%) and then read the median off the horizontal axis.

If we use the data from the example of the statistics results we used before, you will

remember we drew the ogive from the following data:

Less than

Cumulative

frequency

Percentage

cumulative

frequency.

20

8

10

30

17

21.25

40

28

35

50

42

52.5

60

53

66.25

70

63

78.75

80

72

90

90

78

97.5

100

80

100

Table 6.9

Page 197

0

20

40

60

80

100

120

020 40 60 80 100 120

Percentage cumulative frequency.

Percentage cumulative

frequency.

Marks

%

We can read the median off this and we can also read the quartiles. The median is read by

going up to 50% on the vertical axis and the reading the mark off the horizontal axis. In the

above example it is approximately 48 marks.

Using the formula we can also get the median: the formula is:

Median =

m

m

m

mc

f

F

N

L

−

+−1

2

Figure 6.8

Page 198

To use the formula you take the data in its frequency distribution as follows

Median = 40+

10

14

2840

−

= 40+ (

14

12

) 10

= 40+8.57

=48.57

The quartiles can be found also from the ogive or using a similar formula to that above.

Quartile 1 measures the mark below which 25% of the class got (33) and quartile 3 represents

the mark below which 75% of the class got (68). These can be read off the ogive at the 25%

mark and the 75% mark.

The interquartile range is found using the formula

13 QQ −

. This indicates the spread about the

median. The semi-interquartile range (which is similar to the standard deviation) is the

interquartile range divided by 2.

Results

Frequency

0 but less than 20

8

20 but less than 30

9

30 but less than 40

11

(Cumulative 28)

40 but less than 50

14

50 but less than 60

11

60 but less than 70

10

70 but less than 80

9

80 but less than 90

6

90 but less than 100

2

Total

80

Table 6.10

Page 199

0

20

40

60

80

100

120

050 100 150

Percentage cumulative frequency.

Percentage cumulative

frequency.

Marks

%

For data which is normally distributed the median should lie half way between the two

quartiles, if the data is skewed to the right then the median will be closed to quartile 1. Why?

Percentiles are found in the same way as quartiles, the 10% percentile would be found by

going up 10% of the vertical axis, etc.

The Mode

There is no measure of dispersion associated with the mode.

The mode is the most frequently occurring figure in a data set. There is often no mode

particularly with continuous data or there could be a few modes. For raw data you find the

mode by looking at the data as before, or by doing a tally.

For grouped data you can estimate the mode from a histogram by finding the class with the

highest frequency and then estimating.

Figure 6.9

Page 200

Formula for the mode:

Mode =

C

D

L.

D

D

21

1

+

+

To calculate the mode:

1) Determine the modal class, the class with the highest frequency

2) Find

1

D

= difference between the largest frequency and the frequency immediately

preceding it.

3) Find

2

D

= difference between the largest frequency and the frequency immediately

following it.

C= modal class width.

• Measures of dispersion- range, variance, standard deviation, co-efficient of

variation.

The range is explained earlier it is found crudely by taking the highest figure in the data set

and subtracting the lowest figure.

The variance is very similar to the standard deviation and measures the spread of the data. If I

had two different classes and the mean result in both classes was the same, but the variance

was higher in class B then results in class B were more spread out. The variance is found by

getting the standard deviation and squaring it.

The standard deviation is done already.

The co-efficient of variation is used to establish which of two sets of data is relatively more

variable.

For example, take two companies ABC and CBA. You are given the following information

about their share price and the standard deviation of share price over the past year.

Page 201

Mean

Standard deviation

Co efficient of

variation( CV)

ABC

1.2

.8

.67

CBA

1.6

.9

.56

CV =

Mean

Deviation Standard

So CBA shares are relatively less variable.

The Harmonic mean: The harmonic mean is used in particular circumstances namely when

data consists of a set of rates such as prices, speed or productivity.

The formula for this is:

Harmonic mean is: 𝑛

∑1

𝑥

The Geometric mean: This is used to average proportional increases.

An example will illustrate the use of this and the application of the formula:

It is known that the price of a product has increased by 5%2% 11% and 15% in four

successive years.

The GM is:

4

4

367.1

15.111.102.105.1 xxx

= 1.081

Table 6.11

Page 202

Dispersion and Skewness:

The normal distribution is used frequently in statistics. It is not skewed and the mean, median

and the mode will all have the same value. So for normally distributed data it does not matter

which measure of average you use as they are all the same.

Data which is skewed looks like this:

Figure 6.10

Figure 6.11

Page 203

STUDY UNIT 7

The Normal Distribution

Contents

Unit

Title

Page

A.

Introduction

205

B.

The Normal Distribution

207

C.

Calculations Using Tables of the Normal Distribution

209

Tables of the Normal Distribution

209

Using the Symmetry of the Normal Distribution

211

Further Probability Calculations

212

Example

213

Page 204

BLANK

Page 205

A. INTRODUCTION

In Study Unit 4, Section E of this module, we considered various graphical ways of

representing a frequency distribution. We considered a frequency dot diagram, a bar chart, a

polygon and a frequency histogram. For a typical histogram, see Figure 7.1. You will

immediately get the impression from this diagram that the values in the centre are much more

likely to occur than those at either extreme.

Consider now a continuous variable in which you have been able to make a very large

number of observations. You could compile a frequency distribution and then draw a

frequency bar chart with a very large number of bars, or a histogram with a very large

number of narrow groups. Your diagrams might look something like those in Figure 7.2.

Figure 7.1

Figure 7.2

Page 206

If you now imagine that these diagrams relate to relative frequency distribution and that a

smooth curve is drawn through the tops of the bars or rectangles, you will arrive at the idea of

a frequency curve.

Most of the distributions which we get in practice can be thought of as approximations to

distributions which we would get if we could go on and get an infinite total frequency;

similarly, frequency bar charts and histograms are approximations to the frequency curves

which we would get if we had a sufficiently large total frequency. In this course, from now

onwards, when we wish to illustrate frequency distributions without giving actual figures, we

will do so by drawing the frequency curve, as in Figure 7.3.

Figure 7.3

Page 207

B. THE NORMAL DISTRIBUTION

The "Normal" or "Gaussian" distribution is probably the most important distribution in the

whole of statistical theory. It was discovered in the early 18th century, because it seemed to

represent accurately the random variation shown by natural phenomena. For example:

− heights of adult men from one race

− weights of a species of animals

− the distribution of IQ levels in children of a certain age

− weights of items packaged by a particular packing machine

− life expectancy of light bulbs

A typical shape is shown in Figure 7.4. You will see that it has a central peak (i.e. it is

unimodal) and that it is symmetrical about this centre.

The mean of this distribution is shown as m on the diagram, and is located at the centre. The

standard deviation, which is usually denoted by s, is also shown.

There are some interesting properties which these curves exhibit, which allow us to carry out

calculations on them. For distributions of this approximate shape, we find that 68% of the

observations are within ±1 standard deviation of the mean, and 95% are within ±2 standard

deviations of the mean. For the normal distribution, these figures are exact. See Figure 7.5.

Figure 7.4

Page 208

These figures can be expressed as probabilities. For example, if an observation x comes from

a normal distribution with mean m and standard deviation s, the probability that x is between

(m – s) and (m + s) is:

P(m – σ < x < m + σ) = 0.68

Also P(m – 2σ < x < m + 2σ) = 0.95

Figure 7.5

Page 209

C. CALCULATIONS USING TABLES OF THE

NORMAL DISTRIBUTION

Tables of the Normal Distribution

Tables exist which allow you to calculate the probability of an observation being within any

range, not just (m – s) to (m + s) and (m – 2s) to (m + 2s). We show here a set of tables

giving the proportion of the area under various parts of the curve of a normal distribution.

Table 7.1

Page 210

The figure given in the tables is the proportion of the area in one tail of the distribution. The

area under a section of the curve represents the proportion of observations of that size. For

example, the shaded area shown in Figure 48 represents the chance of an observation being

greater than m + 2s. The vertical line which defines this area is at m + 2s. Looking up the

value 2 in the table gives:

P(x > m + 2σ) = 0.02275

which is just over 2%.

Similarly, P(x > m + 1σ) is found by looking up the value 1 in the tables. This gives:

P(x > m + 1σ) = 0.1587

which is nearly 16%.

You can extract any value from P(x > m) to P(x > m + 3σ) from the tables. This means that

you can find the area in the tail of the normal distribution wherever the vertical line is drawn

on the diagram.

Figure 7.6

Page 211

Using the Symmetry of the Normal Distribution

Negative distances from the mean are not shown in the tables. Since the distribution is

symmetrical, it is easy to calculate these.

P(x < m – 5σ) = P(x > m + 5σ)

So P(x > m – 5σ) = 1 – P(x < m – 5σ)

This is illustrated in Figure 7.7.

Figure 7.7

Page 212

Further Probability Calculations

It is possible to calculate the probability of an observation being in the shaded area shown in

Figure 7.8, using values from the tables. This represents the probability that x is between m –

0.7σ and m + 1.5σ