• Aucun résultat trouvé

Exploring Categorical Variables Using the SAS Macro FREQ

Dans le document Data Mining Using (Page 62-68)

Exploratory Data Analysis

3.4 SAS Macro Applications Used in Data Exploration

3.4.1 Exploring Categorical Variables Using the SAS Macro FREQ

The FREQ macro application is an enhanced version of SAS PROC FREQ with graphical capabilities for exploring categorical data. Since the release of SAS version 8.0, many additional statistical capabilities are available for data exploration in the PROC FREQ macro.8 The advantages of using the FREQ SAS macro over PROC FREQ include:

Vertical, horizontal, block, and pie charts for exploring one-way and two-way frequency tables are automatically produced.

Options for saving the output tables and graphics in WORD, HTML, PDF, and TXT formats are available.

Software requirements for using the FREQ macro include:

SAS/CORE, SAS/BASE, and SAS/GRAPH must be licensed and installed at the site.

The FREQ macro has only been tested in the Windows (Windows 98 and later) environment.

SAS versions 8.0 and above are recommended for full utilization.

An active Internet connection is required for downloading the FREQ macro from the book website if the companion CD-ROM is not available.

3.4.1.1 Steps Involved in Running the FREQ Macro 1. Create a temporary or permanent SAS data file.

2. If the companion CD-ROM is not available, first verify that the Internet connection is active. Open the FREQ.sas macro-call file in the SAS PROGRAM EDITOR window. Instructions are given in the Appendix regarding downloading the macro-call and sample data files from the book website. If the companion CD-ROM is available, you will find

3456_Book.book Page 51 Wednesday, November 20, 2002 11:34 AM

the FREQ.sas macro-call file in the mac-call folder in the CD-ROM.

Open the FREQ.sas macro-call file in the SAS PROGRAM EDITOR window. Click the RUN icon to submit the macro-call file FREQ.sas to open the macro window called FREQ (Figure 3.7).

3. Input the appropriate parameters in the macro-call window by following the instructions provided in the FREQ macro help file in Section 3.4.1.2. After inputting all the required macro parameters, be sure the cursor is in the last input field (#11) and the RESULTS VIEWER window is closed, then hit the ENTER key (not the RUN icon) to submit the macro.

4. Examine the LOG window only in the DISPLAY mode for any macro execution errors. If any errors in the LOG window are found, activate the PROGRAM EDITOR window, resubmit the FREQ.sas macro-call file, check the macro input values, and correct any input errors.Otherwise, activate the PROGRAM EDI-TOR window, resubmit the FREQ.sas macro-call file, and change the macro input (#11) value from DISPLAY to any other desirable format (see Section 3.4.1.2). The output, including exploratory graphics and frequency statistics, will be saved as the user -specified format in the user--specified folder as a single file for the file formats WORD, WEB, or PDF. If TXT is selected as the Figure 3.7 Screen copy of FREQ macro-call window showing the macro-call parameters required for exploring categorical variable.

3456_Book.book Page 52 Wednesday, November 20, 2002 11:34 AM

file format in the #11 macro input field, SAS output and graphics files will be saved as separate files.

3.4.1.2 Help File for SAS Macro: FREQ, Description of Macro Parameters

1. Macro-call parameter: Input SAS dataset name (required parameter).

Descriptions and explanation: Include the name of the tempo-rary (member name) or permanent (libname.member_name) SAS dataset name on which the data exploration is to be performed.

Options/examples:

Permanent SAS dataset — gf.cars93 (LIBNAME: gf; SAS dataset name: cars93)

Temporary SAS dataset — cars93 (SAS dataset name)

2. Macro-call parameter: Input response group variable name (required parameter).

Descriptions and explanation: Input name of the categorical variables to be treated as the output variables in a two- or three-way analysis. For creating one-three-way tables and charts, however, input the categorical variable names and leave macro input fields

#3 and #4 blank.

Option/example:

mpg (name of a target categorical variable)

3. Macro-call parameter: Input GROUP variable name (optional statement).

Descriptions and explanation: Input the name of the first-level categorical variable for a two-way analysis.

Option/example:

c2

4. Macro-call parameter: Input BLOCK variable name (optional statement).

Descriptions and explanation: Input the name of the second level categorical variable for a three-way analysis.

Option/example:

b2

5. Macro-call parameter: Plot type options (required statement).

Descriptions and explanation: Select the type of frequency/per-centage statistics desired in the charts.

Options/explanations:

Percent: report percentages Freq: report frequencies

Cpercent: report cumulative percentages Cfreq: report cumulative frequencies

3456_Book.book Page 53 Wednesday, November 20, 2002 11:34 AM

6. Macro-call parameter: Type of patterns used in bars (required statement).

Descriptions and explanation: Select the pattern specifications in different bar charts.

Options/explanations:

Midpoint: Changes patterns when the midpoint value changes. If the GROUP= option is specified, the respective midpoint patterns are repeated for each group report percentage.

Group: Changes patterns when the group variable changes. All bars within each group use the same pattern, but a different pattern is used for each group.

Subgroup: Changes patterns when the value of the subgroup variable changes. The SUBGROUP= option must have been spec-ified. Without SUBGROUP=, all bars will have the same pattern.

7. Macro-call parameter: Color options (required statement).

Descriptions and explanation: Input whether color or black-and-white charts are required.

Options/explanations:

Color: preassigned colors used in charts Gray: preassigned gray shades used in charts

8. Macro-call parameter: zth number of run (required statement).

Descriptions and explanation: SAS output files will be saved by forming a file name from the original SAS dataset name and the counter number provided in macro input field #8. For example, if the original SAS dataset “name” is “gf.cars93” and the counter number included is 1, the SAS output files will be saved as

“gf.cars931.*” in the user-specified folder. By changing the counter numbers, the users can avoid replacing the previous SAS output files with the new outputs.

Options/explanations: Any numbers or letters are valid.

9. Macro-call parameter: Folder to save SAS output (optional statement).

Descriptions and explanation: To save the SAS output files in a specific folder, input the full path of the folder. The SAS dataset name will be assigned to the output file. If this field is left blank, the output file will be saved in the default folder.

Options/explanations:

Possible values

c:\output\ — folder named “OUTPUT”

s:\george\ — folder named “George” in mapped network drive S Be sure to include the back-slash at the end of the folder name.

10. Macro-call parameter: Folder to save SAS graphics (optional statement)

3456_Book.book Page 54 Wednesday, November 20, 2002 11:34 AM

Descriptions and explanation: To save the SAS graphics files in the EMF format suitable for inclusion in PowerPoint presentations, specify the output format as TXT in version 8.0 or later. In pre-8.0 versions, all graphic format files will be saved in a user-specified folder. If the graphics folder field is left blank, the graphics file will be saved in the default folder.

Options/explanations:

Possible values

c:\output\ — folder named OUTPUT

11. Macro-call parameter: Display or save SAS output (required statement).

Descriptions and explanation: Option for displaying all output files in the OUTPUT window or saving files as a specific format in a folder specified in option #9.

Options/explanations:

Possible values

DISPLAY: Output will be displayed in the OUTPUT window. All SAS graphics will be displayed in the GRAPHICS window. System messages will be displayed in the LOG window.

WORD: Output and all SAS graphics will be saved together in the user-specified folder and will be displayed in the VIEWER window as a single RTF format file (version 8.0 and later). In pre-8.0 versions, SAS output will be saved as a text file, and all graphics files will be saved separately in the CGM format in a user-specified folder (macro input option #10).

WEB: Output and graphics are saved in the user-specified folder and are viewed in the results VIEWER window as a single HTML file (version 8.0 and later). In pre-8.0 versions, SAS output will be saved as a text file, and all graphics files will be saved separately in GIF format in a user-specified folder (macro input option #10).

PDF: Output and graphics are saved in the user-specified folder and are viewed in the results VIEWER window as a single PDF (version 8.2 and later) file. In pre-8.2 versions, SAS output will be saved as a text file, and all graphics files will be saved separately in the PNG format in a user-specified folder (macro input option #10).

TXT: Output will be saved as a TXT file in all SAS versions. No output will be displayed in the OUTPUT window. All graphic files will be saved in the EMF format in version 8.0 and later or CGM format in pre-8.0 versions in a user-specified folder (macro input option #10).

Note: System messages are deleted from the LOG window if DIS-PLAY is not selected as the input.

3456_Book.book Page 55 Wednesday, November 20, 2002 11:34 AM

3.4.1.3 Case Study 1: Exploring Categorical Variables in a Permanent SAS Dataset gf.cars93

Open the FREQ macro-call window in SAS (Figure 3.7) and input the appropriate macro input values following the suggestions given in the help file (Section 3.4.1.2). Input MPG (miles per gallon) as the target categorical variable in macro input option #2. Input b2 (origin) as the group variable in macro input option #3. To account for the differences in car types, input c3 (car type) as the block variable in macr o input option #4. After inputting other graphical and file saving parameters, submit the FREQ macro-call window, and SAS will output frequency statistics and exploratory charts for MPG categorical variables by car origin and car type. Only selected output and graphics generated by the FREQ macro are described below.

The one-way frequency and percentage statistics for car origin and car type are presented in Tables 3.1 and 3.2. Two-way percentage statistics for car type and MPG are illustrated in a donut chart in Figure 3.6. Table 3.3 is a two-way frequency table for car type ¥ MPG group for foreign-made cars. The variation in frequency distribution by car type ¥ car origin

¥ MPG group is illustrated as a stacked block chart in Figure 3.4 and as a stacked vertical bar chart in Figure 3.5. No large car is found among the 44 foreign-made cars. Regardless of origin, a majority of the compact and small cars are more fuel efficient than the mid-size, sporty, large, and van-type vehicles.

Source file gf.cars93

Categorical variables MPG (fuel efficiency: 0, below 26 mpg; 1, over 26 mpg)

b2 (origin of cars: 0, foreign; 1, American)

c3 (type of vehicle: compact, large, midsize, small, sporty, van)

Number of observations

93

Data source Lock11

Table 3.1 Macro FREQ: PROC FREQ Output, Frequency, and Percentage Values for Origin

Origin (b2) Frequency Percent

Foreign (0) 45 48.39

Domestic (1) 48 51.61

3456_Book.book Page 56 Wednesday, November 20, 2002 11:34 AM

For the proportion of foreign-made cars, the 95% confidence intervals and exact confidence intervals are given in Table 3.4. The hypothesis test that the foreign-made car proportion in the database is not equal to 0.5 could not be rejected at the 5% level (P value 0.7557 in Table 3.5). The null hypothesis that car type and fuel efficiency (MPG) are independent is rejected at the 5% level based on chi-square test (P value <0.0001 in Table 3.6).

3.4.2 EDA Analysis of Continuous Variables Using SAS

Dans le document Data Mining Using (Page 62-68)