Spatial Intro 07: Cleaning Data -- Missing and Bad Data Values

Authors:
Reviewers:

Overview

About
Things You’ll Need To Complete This Lesson
NoData Values (NA, NAN)
Bad Data Values

About

Add description.

R Skill Level: Intermediate - you’ve got the basics of R down.

#Goals / Objectives

After completing this activity, you will:

Things You’ll Need To Complete This Lesson

To complete this lesson you will need the most current version of R, and preferably, RStudio loaded on your computer.

Install R Packages

NAME: install.packages("NAME")
More on Packages in R - Adapted from Software Carpentry.

Download Data

EDIT AS NEEDED

Download NEON Teaching Data Subset: Airborne Remote Sensing Data

The LiDAR and imagery data used to create this raster teaching data subset were collected over the National Ecological Observatory Network’s Harvard Forest and San Joaquin Experimental Range field sites and processed at NEON headquarters. The entire dataset can be accessed by request from the NEON Airborne Data Request Page on the NEON website.

Set Working Directory: This lesson assumes that you have set your working directory to the location of the downloaded and unzipped data subsets. An overview of setting the working directory in R can be found here..

R Script & Challenge Code: NEON data lessons often contain challenges that reinforce learned skills. If available, the code for challenge solutions is found in the downloadable R script of the entire lesson, available in the footer of each lesson page.

Spatial-Temporal Data & Data Management Lesson Series: This lesson is part of a lesson series introducing spatial data and data management in R . It is also part of a larger spatio-temporal Data Carpentry Workshop that includes working with
raster data in R , vector data in R and
tabular time series in R .

Additional Resources

Clean Data

No dataset is perfect. It is common to encounter large files containing obviously erronious data (bad data). It is also common to encounter NoData values that we need to account for when analyzing our data.

NoData Values (NA, NAN)

If we are lucky when working with external data, the NoData value is clearly specified in the metadata. Sometimes this value is NA or nan (not a number). However, NA isn’t always used. Text values can make data storage difficult for some programs and thus, sometimes you’ll encounter a large negative value such as -9999 used as the NoData value. At other times, we might see blank values in a data file which designate NoData. Blanks are particularly problematic because we can’t be certain if a data value is purposefully missing (not measured that day or a bad measurement) or if someone unintentionally deleted it.

Data Tip:-9999 is a common value used in both the remote sensing field and the atmospheric fields. It is also the standard used by the National Ecological Observatory Network (NEON).

Because the actual value used to designate missing data can vary depending upon what data we are working with, it is important to always check the metadata for the files associated NoData value. If the value is NA, we are in luck, R will recognize and flag this value as NoData. If the value is numeric (e.g., -9999),then we might need to assign this value to NA.

Data Tip: NA values will be ignored when performing calculations in R. However a NoData value of -9999 will be recognized as an integer and processed accordingly. If you encounter a numeric NoData value be sure to assign it to NA in R: objectName[objectName==-9999] <- NA

Check for NA values

We can quickly check for NoData values in our data using theis.na() function. By asking for the sum() of is.na() we can see how many NA/ missing values we have.

REPLACE CODE TO BE FOR THE SAME SMALLISH DATA SET USED FOR BAD DATA VALUES BELOW

# Check for NA values
sum(is.na(harMet15.09.11$datetime))

## Error in eval(expr, envir, enclos): object 'harMet15.09.11' not found

sum(is.na(harMet15.09.11$airt))

## Error in eval(expr, envir, enclos): object 'harMet15.09.11' not found

# view rows where the air temperature is NA 
harMet15.09.11[is.na(harMet15.09.11$airt),]

## Error in eval(expr, envir, enclos): object 'harMet15.09.11' not found

The results above tell us there are NoData values in the datetime column. However, there are NoData values in other variables.

Deal with NoData Values

When we encounter NoData values (blank, NaN, -9999, etc.) in our data we need to decide how to deal with them. By default R treats NoData values designated with a NA as a missing value rather than a zero. This is good, as a value of zero (no rain today) is not the same as missing data (e.g. we didn’t measure the amount of rainfall today).

How we deal with NoData values will depend on:

the data type we are working with
the analysis we are conducting
the significance of the gap or missing value

Sometimes we might need to “gap fill” our data. This means we will interpolate or estimate missing values often using statistical methods. Gap filling can be complex and is beyond the scope of this lesson. The take away from this lessons is simply that it is important to acknowledge missing values in your data and to carefully consider how you wish to account for them during analysis.

Other resources:

Quick-R: Missing Data – R code for dealing with missing data
The Institute for Digital Research and Education has an R FAQ on Missing Values.

Bad Data Values

Bad data values are different from NoDataValue. Bad data values are values that fall outside of the applicable range of a dataset. Examples of Bad Data Values:

The normalized difference vegetation index (NDVI), which is a measure of greenness, has a valid range of -1 to 1. Any value outside of that range would be considered a “bad” or miscalculated value.
If we are using a Julian day (0-365/366) to represent the days of the year. A value of 1110 is clearly not correct.

Find Bad Data Values

Sometimes a dataset’s metadata will tell us the range of expected values for a variable or common sense dictates the expected value (as in the Julian day example above). Values outside of this range are suspect and we need to consider than when we analyze the data. Sometimes, we need to use some common sense and scientific insight as we examine the data - just as we would for field data t identify questionable values.

We can explore the distribution of values contained within our data using the hist function which produces a histogram. Histograms are often useful in identifying outliers and bad data values in our raster data.

HIST of SMALLISH DATA SET WITH SOME OUTLANDISH VALUE

Deal with Bad Data Values

Get Lesson Code

(some browsers may require you to right click.)

/R/dc-spatio-temporal-intro/08-clean-data-bad-and-missing-values.R

View All Tutorials

Workshop:
spatial-data-management-series

Spatial Intro 00: Answer a Spatio-temporal Research Question with Data - Where to Start?

Spatial Intro 01: The Spatial Data Landscape

Spatial Intro 02: Spatial Data Formats -- Intro to Shapefiles (Vector Data) and GeoTIFFs (Raster Data)

Spatial Intro 03: Data About Data -- Intro to Metadata File Formats and Structure

Spatial Intro 04: Intro to Coordinate Reference Systems & Spatial Projections

Spatial Intro 05: Introduction to the UTM Projected CRS and a Brief overview of datums

Spatial Intro 06: A Quick Overview of CRS Formats - Proj4 & EPSG - in R

Spatial Intro 07: Cleaning Data -- Missing and Bad Data Values

Spatial Intro 08: Intro to Spatial Metadata -- Spatial Resolution and Spatial Extent

Spatial Intro 09: R Packages for Spatial-temporal Data -- Setting Up Your Working Environment

View ALL Tutorial Series

Blog.Roll

Spatial Intro 07: Cleaning Data -- Missing and Bad Data Values

Overview

About

Things You’ll Need To Complete This Lesson

Install R Packages

Download Data

Additional Resources

Clean Data

NoData Values (NA, NAN)

Check for NA values

Deal with NoData Values

Bad Data Values

Find Bad Data Values

Deal with Bad Data Values

Get Lesson Code

Workshop: spatial-data-management-series

Blog.Roll

About

Things You’ll Need To Complete This Lesson

Install R Packages

Download Data

Additional Resources

Clean Data

NoData Values (NA, NAN)

Check for NA values

Deal with NoData Values

Bad Data Values

Find Bad Data Values

Deal with Bad Data Values

Get Lesson Code

Workshop:
spatial-data-management-series