This lesson is being piloted (Beta version)

Introduction

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What is messy data?

  • What is OpenRefine?

  • Why use OpenRefine as part of your workflow?

Objectives
  • Describe OpenRefine’s uses and applications.

  • Differentiate data cleaning from data organization.

  • Experiment with OpenRefine’s user interface.

  • Locate helpful resources to learn more about OpenRefine.

Lesson

“Messy data”

Data needs to be consistent in a lot of ways so that you can work with it. The most obvious way to create consistency is by formatting your data so that columns contain data for one variable of one data type and rows contain data for one observation (see the lesson on Spreadsheets). But even if you have carefully structured your spreadsheet, errors can creep in that will cause you issues during analysis.

Today we are going to talk about some of the common things that make data “messy”. These can include:

Most data is at least a little messy. You will probably spend a lot of time cleaning data, and it is an iterative process. Everyone who works with data has to deal with this; you are not alone!

OpenRefine

OpenRefine is an open-source tool that was built to help people clean data. It provides functions that let you investigate your data and then apply fixes to groups of data at the same time. You can also write short scripts to transform columns of data.

While at first glance it may look like using spreadsheet software like Excel there are a few key things that make OpenRefine a good data cleaning tool:

Features

Before we get started

Note: this is a Java program that runs on your machine (not in the cloud). It runs inside your browser, but no web connection is needed.

Follow the Setup instructions to install OpenRefine.

If after installation and running OpenRefine, it does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.

Getting help for OpenRefine.

Key Points

  • OpenRefine is a powerful, free and open source tool that can be used for data cleaning.

  • OpenRefine will automatically track any steps allowing you to backtrack as needed and providing a record of all work done