Filtering and Sorting with OpenRefine
OverviewTeaching: 10 min
Exercises: 10 minQuestions
How can we select only a subset of our data to work with?
How can we sort our data?Objectives
Filter to a subset of rows by text filter or include/exclude.
Sort table by a column.
Sort by multiple columns.
Exploring data by creating facets and applying filters
Facets are one of the most useful features of OpenRefine and can help you both get an overview of the data in a project as well as helping you bring more consistency to the data. OpenRefine supports faceted browsing as a mechanism for
- seeing a big picture of your data, and
- filtering down to just the subset of rows that you want to change in bulk.
A ‘Facet’ groups all the like values that appear in a column and then allows you to filter the data by these values. You can also edit values across many records at the same time.
There are different types of facets for different data types. We are going to start with a ‘Text facet’. This groups all the identical text values in a column and lists the unique values with the number of records it appears in. The facet information always appears in the left hand panel in the OpenRefine interface. If you have numbers or dates in that they will show up in the facet in a group called (number) or (date). If you want a number or date to appear as text in this kind of facet, you have to use a custom text facet and add .toString() after value in the Expression.
Here we will use faceting to look for potential errors in data entry in the
- Scroll over to the
- Click the down arrow and choose
- In the left panel, you’ll now see a box containing every unique value in the
villagecolumn along with a number representing how many times that value occurs in the column.
- Try sorting this facet by name and by count. Do you notice any problems with the data? What are they?
- Hover the mouse over one of the names in the
Facetlist. You should see that you have an
- You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we’ll use instead. We’ll learn about these when we talk about clustering.
What errors do you see that we will need to clean up? We will talk about them as a group.
Chirdozois likely a mis-entry of
Rucais likely a mis-entry of
Ruaca - Nhamuendaand
Ruaca-Nhamuendarefer to the same place (differ only by spaces around the hyphen). You might also wonder if both of these are the same as
Ruaca. We will see how to correct these misspelled and mistyped entries in a later exercise.
- The entry
49is almost certainly an error but you will not be able to fix it by reference to other data.
There are many entries in our data table. We can filter it to work on a subset of the data in the list for the next set of operations. Please ensure you perform this step to save time during the class.
- Click the down arrow next to
Text filter. A
respondent_roof_typefacet will appear on the left margin.
- Type in
mabatand press return. There are 58 matching rows of the original 131 rows (and these rows are selected for the subsequent steps).
- At the top, change the view to
rows. This way you will see most of the matching rows.
- What roof types are selected by this procedure?
- How would you restrict this to only one of the roof types?
Text faceton the
respondent_roof_typecolumn after filtering. This will show that two names match your filter criteria. They are
- To restrict to only one of these two roof types, you could include more letters in your filter.
In addition to the simple text filtering we used above, another way to narrow our filter is to
exclude entries in a facet. You will see the
exclude options if you hover over the name in the facet window.
If you still have your facet for
respondent_roof_type, you can use it, or use drop-down menu >
Text facet to create a new facet. Only the entries with names that agree with your
Text filter will be included in this facet.
Faceting and filtering look very similar. A good distinction is that faceting gives you an overview description of all of the data that is currently selected, while filtering allows you to select a subset of your data for analysis.
include / excludeto select only entries from one of these two roof types.
- In the facet (left margin), click on one of the names, such as
mabatisloping. Notice that when you click on the name, or hover over it, there are entries to the right for
include. This will explicitly include this roof type, and exclude others that are not explicitly included. Notice that the option now changes to
excludeon the other roof type and notice how the two entries appear and disappear from the table.
Remove the filter before moving on so that you again have the full dataset of 131 records.
More on Facets
As well as ‘Text facets’ Refine also supports a range of other types of facet. These include:
- Numeric facets
- Timeline facets (for dates)
- Custom facets
- Scatterplot facets
Numeric and Scatterplot facets display graphs instead of lists of values. The numeric facet graph includes ‘drag and drop’ controls you can use to set a start and end range to filter the data displayed. These facets are explored further in Examining Numbers in OpenRefine
Custom facets are a range of different types of facets. Some of the default custom facets are:
- Word facet - this breaks down text into words and counts the number of records each word appears in
- Duplicates facet - this results in a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if the value in the selected column is an exact match for a value in the same column in another row
- Text length facet - creates a numeric facet based on the length (number of characters) of the text in each row for the selected column. This can be useful for spotting incorrect or unusual data in a field where specific lengths are expected (e.g. if the values are expected to be years, any row with a text length more than 4 for that column is likely to be incorrect)
- Facet by blank - a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if they have no data present in that column. This is useful when looking for rows missing key data.
Using clustering to detect possible typing errors
In OpenRefine, clustering means “finding groups of different values that might be alternative representations of the same thing”. For example, the two strings
New York and
new york are very likely to refer to the same concept and just have capitalization differences. Likewise,
Godel probably refer to the same person. Clustering is a very powerful tool for cleaning datasets which contain misspelled or mistyped entries. OpenRefine has several clustering algorithms built in. Experiment with them, and learn more about these algorithms and how they work.
- In the
villageText Facet we created in the step above, click the
- In the resulting pop-up window, you can change the
Keying Function. Try different combinations to see what different mergers of values are suggested.
- Select the
key collisionmethod and
metaphone3keying function. It should identify two clusters.
- Click the
Merge?box beside each cluster, then click
Merge Selected and Reclusterto apply the corrections to the dataset.
- Try selecting different
Keying Functionsagain, to see what new merges are suggested.
- You should find no more clusters are found. None of the available methods offered to cluster
Chirodzo. To merge these values we need to hover over them in the village text facet, select edit, and manually change the names.
Ruaca. You should now have four clusters:
Important: If you
Merge using a different method or keying function, or more times than described in the instructions above,
your solutions for later exercises will not be the same as shown in those exercise solutions.
Different clustering algorithms
The technical details of how the different clustering algorithm work can be found at the link below.
OpenRefine provides a way to sort and filter data without affecting the raw data.