For this simple procedure, we have a minimum of five sources of error.
Each can occur as a + or - error.
Our total error becomes:
ET = + Ewt1 + Evol + Edrain + Ewt2 + Edensity And this was for a simple example.
Identifying sources of error can help you reduce some sources.
You can never eliminate all sources of error.
The sources will be random in nature.
We must rely on statistical treatment of our data to account for these errors.
A character associated with an event - its tendency to take place.
To see what we’re talking about, let’s use an example - a 10 sided die.
If many people each had an identical die, and each gave it a roll, what would be the expected result?
For a single roll, each value is equally likely to come up.
What if each person had two dice?
0 1 2 3 4 5 6 7 8 9
frequency
1/10
0 1 2 3 4 5 6 7 8 9 Average of two dice - one roll
We can continue this trend, using more dice and a single roll.
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Three dice
Four dice
As the number of dice approaches infinity, the values become continuous.
The curve becomes a normal or Gaussian distribution
P(x)
exp
[
-(x-µ)2!2 2
]
The distribution can be described by:
µ"
f(x) = 1 2#!2 - < X < +
! - standard deviation µ - universal mean
An infinite data set is required.
These terms can be calculated by:
µ =
$
xi / NN i = 1
!2 = variance =
$
(xi - µ)2N i = 1
! = !2
These are for infinite data sets but can be used for data sets where N > 100 and any variation is truly random in nature.
-3! -2! -1! 0 +1! +2! +3!"
2mg 5mg 8mg 11mg 14mg 17mg 20mg When determining !, you are assuming a normal
distribution and relating your data to ! units. -3! -2! -1! 0 +1! +2! +3!"
The area under any portion of the curve tells you the probability of an event occurring + 1! = 68.3% of the data
+ 2! = 95.5%
+ 3! = 99.7%
We can use the normal distribution curve to predict the likelihood of an event occurring.
This approach is only valid for large data sets and is useful for things like quality control of mass produced products.
In the following examples, we will assume that there is a very large data set and that µ and ! have been tracked.
Assuming you know µ and ! for a dataset, you can calculate u (the reduced variable) as:
u = ( x - µ ) / !
This is simply converting your test value from your normal units (mg, hours, ...) to standard deviation.
Assuming that your data is normally distributed, you can use u to predict the probability of an event occurring.
The probability can be found by looking up u on a table - Form A and Form B
Which form you use is based on the question to be asked.
This form will give you the area under the curve from u to |u| area |u| area 0.0 0.5000 2.0 0.0227 0.2 0.4207 2.2 0.0139 0.4 0.3446 2.4 0.0082 0.6 0.2743 2.6 0.0047 0.8 0.2119 2.8 0.0026 1.0 0.1587 3.0 1.3 x 10-3 1.2 0.1151 4.0 3.2 x 10-5 1.4 0.0808 6.0 9.9 x 10-10 1.6 0.0548 8.0 6.2 x 10-16 1.8 0.0359 10.0 7.6 x 10-24
This form will give you the area under the curve from 0 to u
|u| area |u| area 0.0 0.0000 1.6 0.4452 0.2 0.0793 1.8 0.4641 0.4 0.1554 2.0 0.4473 0.6 0.2258 2.2 0.4861 0.8 0.2881 2.4 0.4918 1.0 0.3413 2.6 0.4953 1.2 0.3849 2.8 0.4974 1.4 0.4192 3.0 0.4987
A tire is produced with the following statistics regarding usable mileage.
µ = 58000 miles ! = 10000 miles
What mileage should be guaranteed so that less than 5% of the tires would need to be
replaced?
Looking on Form A, we find that an area of 0.05 comes closest to 1.6 !.
Now use the reduced variable equation using a
! of -1.6 (we want the value < the mean.
-1.6 = ( x - 58000) / 10000 x = 42000 miles
You install a pH electrode to monitor a process stream. The manufacturer provided you with the follow values regarding electrode life time:
µ = 8000 hours
! = 200 hours
If you needed to replace the electrode after 7200 hours of use, was the electrode ‘bad’?
u = (7200 - 8000) / 200 = -4.0 !
Looking on form A, we find that the probability at a value of 4 is 3.2 x 10-5 That means that only 0.0032 % of all
electrodes would fail at 7200 hours or earlier. You have a bad electrode.
For smaller data sets, we use the following
Mean = x =
$
xi / ns
2 =$
( xi - x )2 / (n-1)n i=1
n i=1
When we use smaller data sets, we must be concerned with:
1 Are the samples representative of the population? The values must be truly random.
2 If we pick a non-random set like all men or women - are the differences significant?
This example shows what can happen if a bias is introduced during sampling.
Red - random sampling Blue - bias in sampling
In this example, a bias is introduced simply by separately evaluating males and females.
males females
total population
detection limit bias
nominal value bias
outliers
Introducing a bias is not always a bad thing.
The bias could be due to a true difference between the populations
It also could be due to poor sampling of a population.
We need tools to tell the difference.
Ideally, all three values should be the same.
If not the same, at least very close mean, median and mode
This is a test to see if a population is Gaussian.
g =
$
( x - x)3N i=1
N sx3
g = 0 g < 0 g > 0
For smaller data sets, we use the following variance = s2 =
$
( xi - x )2 / (n-1) standard deviation = s = s2degrees of freedom = df = n-1 (typically)
n i=1
When doing a linear regression fit of a line using X,Y data pairs, the model used (Y = mX + b) results in two parameters.
The degrees of freedom would be N-2 in this case (or model).
So the model used determines the degrees of freedom.
In many cases it is necessary to combine results:
- values from separate labs - data collected on separate days - a different instrument was used
- a different method of analysis was used When we combined the data, we refer to this
as ‘pooling’ the data.
We can’t simply combine all of the values and calculate the mean and other statistical values.
There may have been differences with the results obtained.
There might have been different numbers of samples collected with each set.
We also would like a way to tell if the results are significantly different.
So the pooled standard deviation is weighted for the degrees of freedom. This accounts of the number of data points and any parameters used in obtaining the results.
sp = s12 x df1 + s22 x df2 + ... + sk2 x dfk df1 + df2 + ... + dfk
Set s n df s2 df x s2 1 1.35 10 9 1.82 16.4 2 2.00 7 6 4.00 24.0 3 2.45 6 5 6.00 30.0 4 1.55 12 11 2.40 26.4
sp = 16.4 + 24.0 + 30.0 + 26.4
9 + 6 + 5 + 11 = 1.77
So far, we’ve assumed that all observed variance comes from a single, random source.
- not likely
- there can be many sources of variance We’ll now introduce analyze the variance in
sample sets.
In general, when sources of variance are linearly related (independent and not correlated), the variances are additive.
s
2total= s
12+ s
22+ ... s
k2We often need to do experiments to evaluate the magnitude and sources of variance.
Sample Replicates Deviation Mean 1 15.8, 16.2, 16.4 0.2, 0.0, -0.2 16.1 2 14.9, 15.1, 15.3 0.4, 0.0, -0.4 15.1 3 15.8, 15.8, 15.8 0.0, 0.0, 0.0 15.8 4 16.2, 16.0, 15.9 0.2, 0.0, -0.2 16.0 This data set is of the two level type.
- sample - level one - replicate - level two
X1 X2 X3 X4
x1’ x1’ x1’ x2’ x2’ x2’ x3’ x3’ x3’ x4’ x4’ x4’ Level 1
Level 2
Level 1 gives us an idea as to method variability Level 2 tells use about the sample variability
First - calculate sT
sT2 = = 0.381
Next calculate sM2
sM2 = = 0.218
Finally sa2 = sT2 - sM2 = 0.163
$ (xi - xT)2 Nsamples - 1
$ d2
Nmeasurements- Nsamples
A1 A2 A3 A4
B1 B2 B3 B1 B2 B3 B1 B2 B3 B1 B2 B3
C1 C2 C3 C1 C2 C3
C1 C2 C3
When adding more levels, things rapidly become more complex.
This will tell you where most of your data should occur.
A common calculation to report variability of data.
A quick way of identifying outlying values.
For large data sets
CI = µ + , Z = probability factor CI 2 sided Z
90 1.645
95 1.960
99 2.575
99.99995 5.000 Z
!"
N
This approach assumes that N is a very large value.
Z comes from the infinity row of
the t table
Standard deviation of the mean
t values account for error introduced based on sample size, degrees of freedom and potential sample skew.
Actually use %2 distribution - chi squared.
%2n-1 = (n-1) s2 / !2 This allows us to estimate population
variance from sample variance.
All of this is tied together into the t values.
Degrees of Confidence level Freedom 90% 95% 99%
1 6.31 12.7 63.7
2 2.92 4.30 9.92
3 2.35 3.18 5.84
4 2.13 2.78 4.60
5 2.02 2.57 4.03
6 1.94 2.45 3.71
7 1.90 2.36 3.50
8 1.86 2.31 3.36
9 1.83 2.26 3.25
10 1.81 2.23 3.17
Data: 1.01, 1.02, 1.10, 0.95, 1.00 mean = 1.016
sx = 0.0541 sx = 0.0242
t values for 4 degrees of freedom 95% confidence = 2.13 99% confidence = 2.78 -
CI 95% = 1.016 +
= 1.02 + 0.05 (+ 5%) CI99% = 1.016 +
= 1.02 + 0.07 (+ 7%) 2.13 x 0.0541
51/2
2.78 x 0.0541 51/2
If you have two sets of numbers - from different samples
- from different assays of the same sample
Are they actually different.
If the means are identical, it is more likely an accident than anything else.
We can test to see if results (identical or not) are the same or not.
To test if the two means actually differ, first you must calculate the mean and standard deviation for each sample.
Next you must evaluate based on two possible cases:
Case 1 - A and B do not differ significantly Case 2 - A and B do differ significantly.
Which assumption you makes affects how you approach the problem.
4. Calculate the uncertainty of the difference between the two means.
V& = t ( VA + VB )
Pick t based on desired confidence level and degrees of freedom.
df = nA + nB -2
If | xA - xB| > V&, then the means differ.
(VA + VB)2 VA2
nA-1
( )
+( )
nVBB-1 2 df =4. Calculate the uncertainty of the difference between the two means.
V& = t ( VA + VB )
Pick t based on desired confidence level and degrees of freedom.
df = nA + nB -2
If | xA - xB| > V&, then the means differ.
Both methods should give essentially the same results.
Lets try an example
A - mean = 50 mg/l, s = 2.0 mg/l, n = 5 B - mean = 45 mg/l, s = 1.5 mg/l, n = 6 | xA - xB | = 5 mg / l = &x
1. sp = = 1.74
2. VA = 1.742 / 5 = 0.6056 VB = 1.742 / 6 = 0.5047
3. Use 95% confidence limit (t = 2.262) 4. V& = 2.262 0.6056 + 0.5047 = 2.38
&x > V& (5 > 2.38) so the means are different at the 95% confidence level.
22 x 4 + 1.52 x 5
4 + 5 1. VA = 22 / 5 = 0.800, VB = 1.52 / 6 = 0.375
2. Again, use 95% confidence limits 3. df =
4. V& = 2.365 0.8 + 3.25 = 2.6 &x > V&
We again find that the means are different.
( 0.8 + 0.375)2 0.8
4 0.375 + 5 2
2 = 7.34 7 ~
You can have samples that are considered significantly different and still have the same mean.
In both examples, the populations would be considered to be different - even though the means, medians and modes are identical in example on the right.
This test can be used to tell if two populations are different based on changes in variance.
Examples
! Has the measurement precision changed?
! Has the method been altered?
! Were there any significant changes due
| to the lab or analyst?
Calculation of F
F is always 1 or greater and depends on the confidence level and degrees of freedom.
You can look up the Fc value for the desired levels.
F = S2larger S2smaller
Lets return to the previous example A - mean = 50 mg/l, s = 2.0 mg/l, n = 5 B - mean = 45 mg/l, s = 1.5 mg/l, n = 6 F = 22 / 1.52 = 1.78
Fc is 7.39 at 95% confidence and 9 df.
So, the variance values are essentially the same and the means must really differ.
Sometimes we know that a data point looks bad (outlier). We can’t just pitch it out - there must be a basis for rejection data.
Outliers
Rule of the huge error Dixon test - Q test
Grubbs test
Each has its own advantages / disadvantages.
Assumes that you have some idea as to what the standard deviation should be or can calculate it.
| suspect - mean|
S If M > 4 then you can reject the point.
This is simply a crude t test. It is only useful for discarding obviously ‘bad’ data.
M =
Use ratio '10 for 3 - 7 data points.
xn - xn-1 x2 - x1 xn - x1 xn - x1
'10= or
Two versions based on if its the lowest or highest value that is being tested.
to test high value
to test low value
For 8-10 points, use '11 xn - xn-1 x2 - x1 xn - x2 xn-1 - x1 For 11-13 points, use '21
xn - xn-2 x3 - x1 xn - x2 xn-1 - x1 For 14-25 points, use '22
xn -xn-2 x3 - x1 xn -x3 xn-2 - x1
Risk of false rejection.
Statistic n 0.5% 1% 5% 10%
'10 3 .994 .988 .941 .886 4 .926 .889 .765 .679 5 .821 .780 .642 .557 6 .740 .698 .560 .482 7 .680 .637 .507 .434 '11 8 .725 .683
9 .677 .635 10 .639 .679
'21 11 .713 .642
12 .675 .615 13 .649
'22 14 .674
15 .647
This approach requires calculation of the mean and standard deviation
1. Rank points 2. Pick suspect points
3. Calculate mean and standard deviation using all points.
4. Calculate T. T = |mean - suspect| / sx 5. Look up T on table.
If T > table value then reject it.
Risk of false rejection n 0.1% 1% 5%
3 1.155 1.155 1.153 4 1.496 1.492 1.463 5 1.780 1.749 1.672 6 2.011 1.944 1.822 7 2.201 2.097 1.938 8 2.358 2.221 2.032 9 2.492 2.323 2.110 10 2.606 2.410 2.176