This lesson is being piloted (Beta version)

OpenRefine for Social Science Data: Glossary

Key Points

Introduction
  • OpenRefine is a powerful, free and open source tool that can be used for data cleaning.

  • OpenRefine will automatically track any steps allowing you to backtrack as needed and providing a record of all work done

Working with OpenRefine
  • OpenRefine can import a variety of file types.

  • OpenRefine can be used to explore data using filters.

  • Clustering in OpenRefine can help to identify different values that might mean the same thing.

  • OpenRefine can transform the values of a column.

Filtering and Sorting with OpenRefine
  • OpenRefine provides a way to sort and filter data without affecting the raw data.

Transforms
  • Common Transforms are a great first step in cleaning up data.

  • You can write simple transforms in GREL and reuse them from the History tab.

  • Check to see if someone else has needed to do the same thing as you, you might be able to use their GREL.

Undo, Redo, and Scripts
  • OpenRefine keeps track of your steps.

  • You can step back but you will lose everything you did after that.

  • You can save and apply your steps to other datasets.

Exporting and Saving Data from OpenRefine
  • Cleaned data or entire projects can be exported from OpenRefine.

  • Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.

Other Resources in OpenRefine
  • Other examples and resources online are good for learning more about OpenRefine

Glossary

including tab separated (tsv), comma separated (csv), Excel (xls, xlsx), JSON, XML, RDF as XML, Google Spreadsheets

csv
A file extension indicating that a text file that has values separated by commas (comma-separated-values).
Clustering
A method for finding different groups of values that may actually be representing the same thing.
Faceting
A method for exploring the values in a variable. In this episode it is used to explore the values in order to identify errors in data entry.
Filter
To select a subset of data from a dataframe.
JSON
A file extension indicating that the values in a text file are structured using JavaScript Object Notation (JSON).
RDF
A file that extension indicating that the values in a file are structured using Resource Description Framework (RDF).
Regular expressions (regex)
A text string for describing a search pattern. They usually incorporate the use of wildcards to match letters, numbers, punctuation, spacing, or some combination.
tsv
A file extension indicating that a text file that has values separated by tabs (tab-separated-values).
xls
A file extension indicating that a file is a spreadsheet created by Microsoft Excel.
xlsx
A file extension indicating that a file is a spreadsheet created by Microsoft Excel using XML.
XML
A file extension indicating that the values in a file are structured using Extensible Markup Language (XML).