This lesson is being piloted (Beta version)

OpenRefine for Social Science Data: Glossary

Key Points

  • OpenRefine is a powerful, free and open source tool that can be used for data cleaning.

  • OpenRefine will automatically track any steps allowing you to backtrack as needed and providing a record of all work done

Working with OpenRefine
  • OpenRefine can import a variety of file types.

  • OpenRefine can be used to explore data using filters.

  • Clustering in OpenRefine can help to identify different values that might mean the same thing.

  • OpenRefine can transform the values of a column.

Filtering and Sorting with OpenRefine
  • OpenRefine provides a way to sort and filter data without affecting the raw data.

  • Common Transforms are a great first step in cleaning up data.

  • You can write simple transforms in GREL and reuse them from the History tab.

  • Check to see if someone else has needed to do the same thing as you, you might be able to use their GREL.

Undo, Redo, and Scripts
  • OpenRefine keeps track of your steps.

  • You can step back but you will lose everything you did after that.

  • You can save and apply your steps to other datasets.

Exporting and Saving Data from OpenRefine
  • Cleaned data or entire projects can be exported from OpenRefine.

  • Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.

Other Resources in OpenRefine
  • Other examples and resources online are good for learning more about OpenRefine


including tab separated (tsv), comma separated (csv), Excel (xls, xlsx), JSON, XML, RDF as XML, Google Spreadsheets

A file extension indicating that a text file that has values separated by commas (comma-separated-values).
A method for finding different groups of values that may actually be representing the same thing.
A method for exploring the values in a variable. In this episode it is used to explore the values in order to identify errors in data entry.
To select a subset of data from a dataframe.
A file extension indicating that the values in a text file are structured using JavaScript Object Notation (JSON).
A file that extension indicating that the values in a file are structured using Resource Description Framework (RDF).
Regular expressions (regex)
A text string for describing a search pattern. They usually incorporate the use of wildcards to match letters, numbers, punctuation, spacing, or some combination.
A file extension indicating that a text file that has values separated by tabs (tab-separated-values).
A file extension indicating that a file is a spreadsheet created by Microsoft Excel.
A file extension indicating that a file is a spreadsheet created by Microsoft Excel using XML.
A file extension indicating that the values in a file are structured using Extensible Markup Language (XML).