Handle missing values

Which of the following is typically used to indicate missing data in a table?

This module will explore missing data in SAS, focusing on numeric missing data. It will describe how to indicate missing data in your raw data files, how missing data are handled in SAS procedures, and how to handle missing data in a SAS data step. Suppose we did a reaction time study with six subjects, and the subjects reaction time was measured three times. The data file is shown below.

DATA times ; INPUT id trial1 trial2 trial3 ; CARDS ; 1 1.5 1.4 1.6 2 1.5 . 1.9 3 . 2.0 1.6 4 . . 2.2 5 2.1 2.3 2.2 6 1.8 2.0 1.9 ; RUN ; PROC PRINT DATA=times ; RUN ;

You might notice that some of the reaction times are coded using a single dot. For example, for subject 2, the second trial is coded just as a dot. Well, the person measuring response time for that trial did not measure the response time properly so the data for that trial was missing.

OBS ID TRIAL1 TRIAL2 TRIAL3 1 1 1.5 1.4 1.6 2 2 1.5 . 1.9 3 3 . 2.0 1.6 4 4 . . 2.2 5 5 2.1 2.3 2.2 6 6 1.8 2.0 1.9

In your raw data, missing data are generally coded using a single . to indicate a missing value. SAS recognizes a single . as a missing value and knows to interpret it as missing and handles it in special ways. Let’s examine how SAS handles missing data in procedures.

2. How SAS handles missing data in SAS procedures

As a general rule, SAS procedures that perform computations handle missing data by omitting the missing values. (We say procedures that perform computations to indicate that we are not addressing procedures like proc contents). The way that missing values are eliminated is not always the same among SAS procedures, so let’s us look at some examples. First, let’s do a proc means on our data file and see how SAS proc means handles the missing values.

PROC MEANS DATA=times ; VAR trial1 trial2 trial3 ; RUN ;

As you see in the output below, proc means computed the means using 4 observations for trial1 and trial2 and 6 observations for trial3. In short, proc means used all of the valid data and performed the computations on all of the available data.

Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------- TRIAL1 4 1.7250000 0.2872281 1.5000000 2.1000000 TRIAL2 4 1.9250000 0.3774917 1.4000000 2.3000000 TRIAL3 6 1.9000000 0.2683282 1.6000000 2.2000000 -------------------------------------------------------------------

As you see below, proc freq likewise performed its computations using just the available data. Note that the percentages are computed based on just the total number of non-missing cases.

PROC FREQ DATA=times ; TABLES trial1 trial2 trial3 ; RUN ; Cumulative Cumulative TRIAL1 Frequency Percent Frequency Percent ---------------------------------------------------- 1.5 2 50.0 2 50.0 1.8 1 25.0 3 75.0 2.1 1 25.0 4 100.0 Frequency Missing = 2 Cumulative Cumulative TRIAL2 Frequency Percent Frequency Percent ---------------------------------------------------- 1.4 1 25.0 1 25.0 2 2 50.0 3 75.0 2.3 1 25.0 4 100.0 Frequency Missing = 2 Cumulative Cumulative TRIAL3 Frequency Percent Frequency Percent ---------------------------------------------------- 1.6 2 33.3 2 33.3 1.9 2 33.3 4 66.7 2.2 2 33.3 6 100.0

It is possible that you might want the percentages to be computed out of the total number of values, and even report the percentage missing right in the table itself. You can request this using the missing option on the tables statement of proc freq as shown below (just for trial1).

PROC FREQ DATA=times ; TABLES trial1 / MISSING ; RUN ;

As you see, now the percentages are computed out of the total number of observations, and the percentage missing are shown right in the table as well.

Cumulative Cumulative TRIAL1 Frequency Percent Frequency Percent ---------------------------------------------------- . 2 33.3 2 33.3 1.5 2 33.3 4 66.7 1.8 1 16.7 5 83.3 2.1 1 16.7 6 100.0

Let’s look at how proc corr handles missing data. We would expect that it would do the computations based on the available data, and omit the missing values. Here is an example program.

PROC CORR DATA=times ; VAR trial1 trial2 trial3 ; RUN ;

The output of this program is shown below. Note how the missing values were excluded. For each pair of variables, proc corr used the number of pairs that had valid data. For the pair formed by trial1 and trial2, there were 3 pairs with valid data. For the pairing of trial1 and trial3 there were 4 valid pairs, and likewise there were 4 valid pairs for trial2 and trial3. Since this used all of the valid pairs of data, this is often called pairwise deletion of missing data.

Correlation Analysis 3 'VAR' Variables: TRIAL1 TRIAL2 TRIAL3 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum TRIAL1 4 1.725000 0.287228 6.900000 1.500000 2.100000 TRIAL2 4 1.925000 0.377492 7.700000 1.400000 2.300000 TRIAL3 6 1.900000 0.268328 11.400000 1.600000 2.200000 Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / Number of Observations TRIAL1 TRIAL2 TRIAL3 TRIAL1 1.00000 0.98198 0.85280 0.0 0.1210 0.1472 4 3 4 TRIAL2 0.98198 1.00000 0.76089 0.1210 0.0 0.2391 3 4 4 TRIAL3 0.85280 0.76089 1.00000 0.1472 0.2391 0.0 4 4 6

It is possible to ask SAS to only perform the correlations on the observations that had complete data for all of the variables on the var statement. For example, you might want the correlations of the reaction times just for the observations that had non-missing data on all of the trials. This is called listwise deletion of missing data meaning that when any of the variables are missing, the entire observation is omitted from the analysis. You can request listwise deletion within proc corr with the nomiss option as illustrated below.

PROC CORR DATA=times NOMISS ; VAR trial1 trial2 trial3 ; RUN ;

As you see in the results below, the N for all the simple statistics is the same, 3, which corresponds to the number of cases with complete non-missing data for trial1 trial2 and trial3. Since the N is the same for all of the correlations (i.e., 3), the N is not displayed along with the correlations.

OBS ID TRIAL1 TRIAL2 TRIAL3 1 1 1.5 1.4 1.6 2 2 1.5 . 1.9 3 3 . 2.0 1.6 4 4 . . 2.2 5 5 2.1 2.3 2.2 6 6 1.8 2.0 1.90

3. Summary of how missing values are handled in SAS procedures

It is important to understand how SAS procedures handle missing data if you have missing data. To know how a procedure handles missing data, you should consult the SAS manual. Here is a brief overview of how some common SAS procedures handle missing data.

– proc means
For each variable, the number of non-missing values are used
proc freq
By default, missing values are excluded and percentages are based on the number of non-missing values. If you use the missing option on the tables statement, the percentages are based on the total number of observations (non-missing and missing) and the percentage of missing values are reported in the table.
proc corr
By default, correlations are computed based on the number of pairs with non-missing data (pairwise deletion of missing data). The nomiss option can be used on the proc corr statement to request that correlations be computed only for observations that have non-missing data for all variables on the var statement (listwise deletion of missing data).
proc reg
If any of the variables on the model or var statement are missing, they are excluded from the analysis (i.e., listwise deletion of missing data)
proc factor
Missing values are deleted listwise, i.e., observations with missing values on any of the variables in the analysis are omitted from the analysis.
proc glm
The handling of missing values in proc glm can be complex to explain. If you have an analysis with just one variable on the left side of the model statement (just one outcome or dependent variable), observations are eliminated if any of the variables on the model statement are missing. Likewise, if you are performing a repeated measures ANOVA or a MANOVA, then observations are eliminated if any of the variables in the model statement are missing. For other situations, see the SAS/STAT manual about proc glm.
For other procedures, see the SAS manual for information on how missing data are handled.

4. Missing values in assignment statements

It is important to understand how missing values are handled in assignment statements. Consider the example shown below.

OBS ID TRIAL1 TRIAL2 TRIAL3 1 1 1.5 1.4 1.6 2 2 1.5 . 1.9 3 3 . 2.0 1.6 4 4 . . 2.2 5 5 2.1 2.3 2.2 6 6 1.8 2.0 1.91

The proc print below illustrates how missing values are handled in assignment statements. The variable avg is based on the variables trial1 trial2 and trial3. If any of those variables were missing, the value for avg was set to missing. This meant that avg was missing for observations 2, 3 and 4.

OBS ID TRIAL1 TRIAL2 TRIAL3 1 1 1.5 1.4 1.6 2 2 1.5 . 1.9 3 3 . 2.0 1.6 4 4 . . 2.2 5 5 2.1 2.3 2.2 6 6 1.8 2.0 1.92

In fact, SAS included a NOTE: in the Log to let you know about the missing values that were created. The Log entry from this example is shown below.

OBS ID TRIAL1 TRIAL2 TRIAL3 1 1 1.5 1.4 1.6 2 2 1.5 . 1.9 3 3 . 2.0 1.6 4 4 . . 2.2 5 5 2.1 2.3 2.2 6 6 1.8 2.0 1.93

This note tells us that three missing values were created in the program at line 224. This makes sense, we know that 3 missing values were created for avg and that avg is created on line 224.

As a general rule, computations involving missing values yield missing values. For example,

2 + 2 yields 4
2 + . yields .
2 / 2 yields 1
. / 2 yields .
2 * 3 yields 6
2 * . yields .

whenever you add, subtract, multiply, divide, etc., values that involve missing data, the result is missing.

In our reaction time experiment, the average reaction time avg is missing for three out of six cases. We could try just averaging the data for the non-missing trials by using the mean function as shown in the example below.

OBS ID TRIAL1 TRIAL2 TRIAL3 1 1 1.5 1.4 1.6 2 2 1.5 . 1.9 3 3 . 2.0 1.6 4 4 . . 2.2 5 5 2.1 2.3 2.2 6 6 1.8 2.0 1.94

The results below show that avg now contains the average of the non-missing trials.

OBS ID TRIAL1 TRIAL2 TRIAL3 1 1 1.5 1.4 1.6 2 2 1.5 . 1.9 3 3 . 2.0 1.6 4 4 . . 2.2 5 5 2.1 2.3 2.2 6 6 1.8 2.0 1.95

Had there been a large number of trials, say 50 trials, then it would be annoying to have to type
avg = mean(trial1, trial2, trial3 …. trial50)
Here is a shortcut you could use in this kind of situation
avg = mean(of trial1-trial50)

Also, if we wanted to get the sum of the times instead of the average, then we could just use the sum function instead of the mean function. The syntax of the sum function is just like the mean function, but it returns the sum of the non-missing values.

Finally, you can use the N function to determine the number of non-missing values in a list of variables, as illustrated below.

OBS ID TRIAL1 TRIAL2 TRIAL3 1 1 1.5 1.4 1.6 2 2 1.5 . 1.9 3 3 . 2.0 1.6 4 4 . . 2.2 5 5 2.1 2.3 2.2 6 6 1.8 2.0 1.96

As you see below, observations 1, 5 and 6 had three valid values, observations 2 and 3 had two valid values, and observation 4 had only one valid value.

OBS ID TRIAL1 TRIAL2 TRIAL3 1 1 1.5 1.4 1.6 2 2 1.5 . 1.9 3 3 . 2.0 1.6 4 4 . . 2.2 5 5 2.1 2.3 2.2 6 6 1.8 2.0 1.97

You might feel uncomfortable with the variable avg for observation 4 since it is not really an average at all. We can use the variable n to create avg only when there are two or more valid values, but if the number of non-missing values is 1 or less, then make avg to be missing. This is illustrated below.

OBS ID TRIAL1 TRIAL2 TRIAL3 1 1 1.5 1.4 1.6 2 2 1.5 . 1.9 3 3 . 2.0 1.6 4 4 . . 2.2 5 5 2.1 2.3 2.2 6 6 1.8 2.0 1.98

In the output below, you see that avg now contains the average reaction time for the non-missing values, except for observation 4 where the value is assigned to missing because it had only 1 valid observation.

OBS ID TRIAL1 TRIAL2 TRIAL3 1 1 1.5 1.4 1.6 2 2 1.5 . 1.9 3 3 . 2.0 1.6 4 4 . . 2.2 5 5 2.1 2.3 2.2 6 6 1.8 2.0 1.99

5. Missing values in logical statements

It is important to understand how missing values are handled in logical statements. For example, say that you want to create a 0/1 value for trial1 that is 0 if it is 1.5 or less, and 1 if it is over 1.5. We show this below (incorrectly, as you will see).

PROC MEANS DATA=times ; VAR trial1 trial2 trial3 ; RUN ; 0

And as you can see in the output, the values for trial1a are wrong when id is 3 or 4, when trial1 is missing. This is because SAS treats a missing value as the smallest possible value (e.g., negative infinity) and that value is less than 1.5, so then the value for trial1a becomes 0.

PROC MEANS DATA=times ; VAR trial1 trial2 trial3 ; RUN ; 1

Instead, we will explicitly exclude missing values to make sure they are treated properly, as shown below.

PROC MEANS DATA=times ; VAR trial1 trial2 trial3 ; RUN ; 2

And now we get the results that we wish. The value for trial1a is only 0 when it is less than or equal to 1.5 and it is not missing. The value for trial1a is only 1 when it is over 1.5, as shown below.

Which one of the following statements is true of the construction of visuals in a document?

Which of the following statements is true regarding the construction of visuals in a document? The type used in visuals throughout a report should generally be consistent in terms of style and font.

Which type of chart shows the hierarchy of levels and positions within a business or group?

The pyramid-shaped organizational chart we referred to earlier is known as a hierarchical org chart. It's the most common type of organizational structure—the chain of command goes from the top (e.g., the CEO or manager) down (e.g., entry-level and low-level employees), and each employee has a supervisor.

When you have created a visual based on data that you or your staff gathered?

When you have created a visual based on data that you or your staff gathered, you may omit a source acknowledgment. In constructing a pictograph, you must make all the picture units equal in size. A clustered bar chart can be used to effectively compare the quantities of five or more different values.

What are the four principles of design that you should follow when creating business documents?

Effective design centres on four basic principles: contrast, repetition, alignment and proximity. These appear in every design.

Which of the following is typically used to indicate missing data in a table?

2. How SAS handles missing data in SAS procedures

3. Summary of how missing values are handled in SAS procedures

4. Missing values in assignment statements

5. Missing values in logical statements

Which one of the following statements is true of the construction of visuals in a document?

Which type of chart shows the hierarchy of levels and positions within a business or group?

When you have created a visual based on data that you or your staff gathered?

What are the four principles of design that you should follow when creating business documents?

Bài Viết Liên Quan

Sensing thinking, and judging represent three dimensions of

Revit 2023 library

Toplist

Top 9 kỷ luật đảng trước hay kỷ luật chính quyền trước 2022

Top 6 kể chuyện sherlock holmes 2022

Top 8 làm giấy ủy quyền rút tiền ngân hàng 2022

Top 28 baby shark có bao nhiều nhân vật 2022

Top 9 ảnh vỡ màn hình iphone 2022

Top 10 kinh tế nông nghiệp lượng bảo nhiều 2022

Top 10 các cấu trúc viết lại câu thi hsg 7 2022

Top 10 ngày giờ đẹp sinh con tháng 10 năm 2022 2022

Top 10 tình cảm của tác giả đối với miền trung được thể hiện trong đoạn trích 2022

Bài mới nhất

Tuổi bính thìn hợp tuổi nào xông đất năm 2024

Bài tập tăng cường vượng khí cho tuổi dần năm 2024

Công văn giải trình mất chứng từ bằng tiếng anh năm 2024

Bài tập hóa học lớp 10 chương 2 tự luân năm 2024

Bài kiểm tra toán lớp 4 học kì 1 năm 2024

Giải bài tập hóa 9 bài 56 trang 168 năm 2024

Tuyển sinh văn bằng 2 tiếng anh tphcm năm 2024

Quan hệ thời điểm nào là dễ có thai nhất năm 2024

Excel 2007 bị lỗi giật giật khi thao tác năm 2024

Hình ảnh cho sách giáo khoa hóa 9 trang 54 năm 2024

Chủ đề