• Aucun résultat trouvé

2. T HEORY & M ETHODS

2.2. Methodological Framework

2.2.6. Data Statistical Treatment

Once collected data have been ordinated and described, I proceeded with the application of statistic tests to prove their significance and consistency. I selected few well-known techniques for which there is abundant theoretical and empirical literature, as well several examples of archaeological applications. All the statistics chosen have been applied systematically to all the studied sites, with only little variation depending on the specifics of each lithic assemblage. Significance tests have been realized mainly on a intra-site level, while

multivariate and cluster analyses have been realized on a inter-site level to compare data between the various contexts.

All the operations and tests have been realized with the software IBM SPSS Statistic v. 21.

The analysis and interpretation of the data has been mainly based on IBM SPSS Statistics Guide to Data Analysis (Norusis 2011) and Pérez’s SPSS manual for Multivariate analysis (2004), while a review of the various statistical techniques and their application from an archaeological perspective have been taken from Shennan (1992), Baxter (2003) and Drennan (2009).

i. Standard deviation, variance, mean, median: basic statistics can be used when one needs to evaluate the distribution of a certain variable, especially for metric variables such as length, width and thickness. For example, the average measurements of each class of tools (e.g. sickle blades, hide scrapers on flake, etc.) have been estimated considering median and variance as best representing the measures of central tendency.

ii. χ2: Pearson chi-square test is a statistic used to discover if there is a relationship between two categorical variables. Pearson’s test is only one of the many type of chi-squared tests (e.g. Yates, likelihood-ratio, etc.) which can be fundamentally defined as statistical procedures of which results are evaluated by reference to the chi-squared distribution. In practice, given a crosstab with two variables, the chi-square test provides a method for testing the association between the row and column variables.

The null hypothesis H0 assumes that there is no association between the variables (in other words, one variable does not vary according to the other variable), while the alternative hypothesis Ha claims that some association does exist. One of the ‘limits’

related to the application of the chi-square test is that the alternative hypothesis (Ha) does not specify the type of association nor its intensity, but only indicates a probability for the association to exist. Moreover, the chi-square test is affected by the size of the employed sample. One of the chi-square basic requirement is of a minimum cell expectation of 5 (at least for the 80% of the cells); this constraint is not always satisfied by my data, as often I have to deal with contingency tables with sparsely populated cells. This limit can be eluded grouping variables into larger categories with a greater number of items. For samples large enough, chi-square represent a rapid and quick method to analyse the distribution of two variables and explore the existence of significance associations. I mainly employed this test to explore the distribution of data from technological and raw-material analysis among a set of archaeological phases, as well their mutual relationship.

iii. Correspondence Analysis (CA): Correspondence Analysis is a statistical test that allows examining the relationship between two nominal variables in a multidimensional space.

It computes row and column frequencies and produces plots based on the scores.

Categories that are similar to each other appear close to each other in the plot. In this way, it is easy to see which categories of a variable are similar to each other or which categories of the two variables are related. The distance between the variables is computed using the chi-square distance, thus one of the requirements of this test is that analysed data satisfy the chi-square requirements (expected value>5). For what concern the normalization procedure, exist several methods; in my case I employed the default method proposed by SPSS software: symmetric normalization, which is the most useful method when examining di erences or similarities between rows and columns. I applied this statistic to the analysis of the raw-materials distribution among

the different occupational phases studied in this work, with the objective of highlighting associations between sites and lithologies.

iv. Cluster Hierarchical Analysis (HCA): Cluster Hierarchical is an exploratory tool designed to reveal natural groupings (or clusters) within a data set. Is one of the most common approaches in archaeology as it allow to categorize data, to separate them and thus ordinate them. The idea of the test is that cases that are strongly similar to each other, in terms of their values for a number of variables, wind up in the same groups, while those that are more different from each other join in different clusters. In this sense — seeking structures in the relationships among cases characterized by a number of variables— HCA is quite similar to CA. However, in this case, one of the main advantages of Cluster analysis is that cells with value zero are as well considered valid data and represented in the diagram. In fact, I employed Clustering analysis to compare the type and frequency of the economic activities carried out in each site/phase. Thus, the absence of a certain economic activity is considered as well a relevant data and not a null value. I applied different clustering method and measures depending on the data considered in the analysis. At first, starting with a reduced sample, I applied the Single Linkage Clustering (or Nearest neighbour). With this method the distance between two clusters is defined as the smallest distance between two cases in different clusters. That means that at every step, the distance between two clusters is taken to be the distance between their two closest members. Single-Linkage represents the simplest approach, but often it is very useful to identify the presence of outliers (i.e. outlier is an observation that lies an abnormal distance from other values in a random population).

In fact, as the single linkage algorithm is based on minimum distances, it tends to form few large clusters with the other clusters containing only one or few objects each, while the other methods tend to avoid single-element clusters. In this first test, expecting the existence of single-element cluster, I decided to apply this method. In the second step, enlarging the sample with new data, I run the Ward’s procedure, to confirm the precedent classification and to test the number of clusters. In this approach the means for all variables are processed and, then, for each case, the squared Euclidean distance to the cluster means is calculated. At each step, the two clusters that merge are those that result in the smallest increase in the overall sum of the squared within-cluster distance. In this case, expecting somewhat equally sized clusters and being the outliers absorbed with the adding of new data, I opted for Ward’s method. After Ward method, the last step of the so-called ‘three-step clustering’ procedure is the K-means method.

This test serve to prove the stability of the clusters previously obtained. The number of clusters (k) is, indeed, provided as an input parameter at the beginning of the analysis. The objective of the k-means is to verify whether there is a change in the clusters centres or not, with a re-assignations of objects between the various groups. If the initial partitioning of the objects in the first step of the k-means procedure is retained, it means that was not possible to reduce the overall within-cluster variation.

This result provides evidences of the clusters stability and reliability.

v. One-way ANOVA: Analysis of variance, or ANOVA, is a linear modelling method employed to evaluate the relationship among fields. It tests whether the mean values vary across certain categories of a certain input. ANOVA test compares the explained variance (caused by the input fields) with the unexplained variance (caused by the error source). If the ratio of explained variance to unexplained variance is high, the means

are statistically different. In my case, I employed the one-way ANOVA test to determine which classifying variables are significantly different between the clusters identified with the Ward’s method. F statistics establish if there is or not a difference between means and whether if any of these mean differences are significant. Using this test, one can understand which variables contribute the most to the solution obtained through HCA.

vi. Tukey post-hoc: also known as Tukey range test or Tukey method is a statistic used in conjunction with ANOVA to find which means are significantly different from each other. In fact, post-hoc are designed for situations in which one has already obtained a significant omnibus F-test and an additional exploration of the differences among means is needed. Tukey post-hoc provides specific information on which means are significantly different from each other. In my case, I employed Tukey post-hoc to evaluate the contribution of the significant variables to each one of the cluster identified.