NTEN sample and existing data
NTEN has given me some example data to clean up for a mock import into their constituent database. The data are five files:
- new_email_import_files.tar from a service provider, which need to be imported into the NTEN database
- One existing_email_list of email addresses already in the NTEN database
The import data have several inconsistencies, and need to be cross-referenced against the existing NTEN data to avoid importing duplicates.
LibreOffice and Open Refine (Google Refine)
The steps in this section are demonstrated using LibreOffice. However, any spreadsheet software will work similarly.
Inconsistent headers and data structure
A quick overview of the data structure indicates that it is necessary to re-arrange the columns and normalize the column headers.
The data are merged into a single CSV file (NewData-Merged-ImportColumnsOnly), with consistent headers. The merged CSV also contains only necessary columns, as not all of the columns from the original data are to be imported.
A quick, visual scan of the ’email’ column indicates that there is leading whitespace, and possibly trailing whitespace, that will be removed.
Comparison data – existing email list
The existing email list contains 40,256 entries. These entries need to be scanned to see any of the import data match. The column header ’email’ is added for consistency.
The following sections describe steps to clean and refine the data using Open (Google) Refine. It is assumed that OpenRefine is downloaded on your local computer, if you want to follow the steps.
Create projects for current email list and new data import
When you start Open (Google) Refine, you are presented with the Create Project screen. The screen reads:
Create a project by importing data. What kinds of data files can I import? TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents are all supported. Support for other formats can be added with Google Refine extensions.
We will create two projects, so this process will be repeated. Be sure to keep track of the project names, as we will need to cross-reference data from one to the other later on. My project titles are as follows:
On the Create Project page, click the Browse button next to Get Data From: This Computer
- Select the file NewData-Merged-ImportColumnsOnly.csv
- Click Open
- Click Next
At the project settings screen,
- Give the project a title, such as nten_new_data_import
- Verify all of the settings match the CSV fie (or other data)
- Click Create Project
To create a new project, click the Google Refine logo.
For the nten_existing_email_list project, repeat the previous steps, using the file existing_email_list.csv
Your project should look similar to the following picture.
Clean-up and refine nten_new_data_import project
We will first clean up the email addresses with a common text transformation:
- Trim leading and trailing whitespace – removes spaces and tabs from either end of a field value.
To access the Trim leading and trailing whitespace function:
- Click the down arrow in the ’email’ column
- Hover over Edit Cells
- Hover over Common Transformations
- Click Trim leading and trailing whitespace
You should see a yellow indicator box describing the action and how many cells were affected.
Remove rows with duplicate email addresses, keeping most detailed row
There are several rows with duplicate email addresses. To do this, we will sort the email column alphabetically, and show only rows with duplicate entries.
Sort the email column
- Click the down arrow on the ’email’ column header
- Hover over Sort
- Click “Sort…” (in the sub-menu)
You should see the following dialogue.
Click OK to sort the column.
Note: Make sure “Show as: ” is set to Rows, not Records
Facet the email column by ‘duplicates’
We will now show only rows that contain duplicate emails, within the import data. We will keep the most detailed record, flagging the other records for later faceting.
- Click the down arrow in the email column header
- Hover over Facet
- Hover over Customized Facets
- Click Duplicates Facet
- In the Facet/Filter column you should see an email box appear
- Click true
Flag duplicates with insufficient details, leaving one of each duplicate
- Scan through the duplicate rows clicking the Flag icon next to rows with few details
- Leave at least one row for each duplicate email, this row should have the most details compared to other duplicate rows
- Note: Make sure “Show as: ” is set to Rows, not Records
- Note: You may want to set “Show: ” to a higher value, such as 25
Remove the duplicate email facet, add a facet for flagged rows
Now that we have flagged the duplicate rows with insufficient details, we can remove the duplicates facet.
- Click the X next to the email facet title, in the left sidebar
- Observe that the flags are still in place
- Click the down arrow in the All column header
- Hover over Facet
- Click Facet by flag
We now want to show only rows without a flag.
- Locate the Flagged rows box under Facet / Filter on the left hand sidebar
- Click false – this shows only rows that haven’t been flagged
Flag existing emails based on email list
OK, so we’ve narrowed down our data, and cleaned it up a bit. Lets go ahead and filter out rows that match existing email addresses in the nten_existing_emails project. We will be using a bit of programming here, but it is just cut-and-paste.
The small script we will use calls on three functions
- cross() – cross references a column in another project. We will use this to check if an email address exists in the nten_existing_emails project
- length() – checks the length of something. This can be a list of things, such as cells in a row. If a row has three cells (e.g. first_name, last_name, email) the length of the row will be three.
- if() – checks if a statement is true or false. If the statement is true, something happens. If false, something different happens.
Each function has a pair of parenthesis. These parenthesis accept stuff, like a coin slot. When you put a coin in (calling the function), the function does something, and gives something back (returns something) to be used later in the program. For example, I put a coin in the slot and, hopefully, receive a green gumball. For what it’s worth, some functions work without any coin being inserted, you can just ‘bump’ them while some functions take several coins to get started.
Add a new column based on email column and cross-reference with nten_existing_emails project
We will create a new column with a value of ‘New’ or ‘Existing’ for each row. Email addresses that aren’t in the nten_existing_emails project, will have a value of ‘New’ in this column. Email addresses that are in the nten_existing_emails project, will have a value of ‘Existing’ in this column.
- Click the down arrow in the email column header
- Hover over Edit column
- Click Add new column based on this column
In the dialogue box that poops up,
- Fill in the New column name with the value “new_or_existing”
- Paste the following code into the Expression text area
- Click OK
if ( length ( cross( cell ,'nten_existing_email_list' ,'email' ) ) > 0 , "Duplicate" , "New" )
We should now have a column indicating whether email addresses are new or already exist in the NTEN database.
Facet based on new_or_existing column to show only new emails
We can now narrow down our dataset to show only new emails.
- Click on the down arrow in the new_or_existing column header
- Hover over Facet
- Click Text facet
A new facet will appear in the Facets / Filters sidebar.
- Locate the new_or_existing facet in the Facets / Filters sidebar
- Click new
Export the data
Now that we have successfully cleaned up the data and narrowed it down to only new email records, we can export!
- In the upper right hand corner, click the Export button
- Click to select the export format you desire (in this case, I’m selecting CSV)
This data (nten_new_data_import-refined.csv) can now be used for further analysis, import into a database, etc.
The following is the cross-reference code, with inline comments.
// Brief explanatory notes if ( // Check a condition and perform one of two actions length( // Look at the length of the cross reference row cross( // Cross reference another project, return existing row cell // using the value of each cell ,'nten_existing_email_list' // looking in the 'nten_email_list' project ,'email' // 'nten_email_list' email column ) ) > 0 // if the length of the cross reference row is greater than zero , "Duplicate" // The cross referenced email is a duplicate , "New" // The cross-referenced email is new )