Greeted by the setting sun.
Well… I missed my return flight to Oregon, after the Nonprofit Technology Conference. I went to Dulles International instead of Regan.
While waiting for the bus back to D.C., I met a man from Ethiopia. We traveled together on the bus and light rail to meet his cousin. The three of us went to dinner at an authentic Ethopian restaurant in Maryland.
After dinner, I checked in to the Hilltop Hostel.
Route ten to Union Station
morning smog haze
empty bus (720)
single occupancy vehicles
The World Without Us
Firefox OS is a mobile operating system built on web standard technologies.
To get started with Firefox OS, I recommend the following book:
Quick Guide For Firefox OS App Development by Andre Garzia.
“Learn how easy and quick it is to develop applications for Firefox OS, the new mobile operating system by Mozilla. Empowered by this books practical approach you will learn thru examples how to develop apps from the beginning all the way to the distribution in the Firefox Markeplace.”
You can download the book for free, or choose to pay the author a reasonable sum for his work
Form SPAM is common with pubic websites. SPAM bots have been trained to fill out any form they encounter, with common fields such as ‘email’, ‘name’, etc..
There are several approaches to dealing with form spam. A common approach is to put up hard-to-read words on the form, called CAPTCHA, and have the user type the letters in a field to prove they are human. This approach is losing its usefulness, as SPAM bots are able to correctly detect the letters sometimes. It also punishes the humans, who have to strain to read the letters. This detracts usability diminishes accessibility.
An alternative approach places a hidden field on the form that only SPAM bots can see. When a form is submitted with this field, the software automatically knows it is junk. This field is called a ‘honeypot’. The honeypot technique can be combined with other heuristic tests to determine spam-botti-ness, such as a threshold of time taken to fill out a form. For example, the average human may take 30-60 seconds to fill out a form, while a SPAM bot can SPAM the form dozens of times a minute. If the system detects frequent, or very fast, submissions, it can flag the submission as SPAM.
Drupal has a project, called Honeypot, that provides a non-invasive approach to SPAM detection, using the methods outlined above.
NTEN has given me some example data to clean up for a mock import into their constituent database. The data are five files:
The import data have several inconsistencies, and need to be cross-referenced against the existing NTEN data to avoid importing duplicates.
The steps in this section are demonstrated using LibreOffice. However, any spreadsheet software will work similarly.
A quick overview of the data structure indicates that it is necessary to re-arrange the columns and normalize the column headers.
The data are merged into a single CSV file (NewData-Merged-ImportColumnsOnly), with consistent headers. The merged CSV also contains only necessary columns, as not all of the columns from the original data are to be imported.
A quick, visual scan of the ‘email’ column indicates that there is leading whitespace, and possibly trailing whitespace, that will be removed.
The existing email list contains 40,256 entries. These entries need to be scanned to see any of the import data match. The column header ‘email’ is added for consistency.
The following sections describe steps to clean and refine the data using Open (Google) Refine. It is assumed that OpenRefine is downloaded on your local computer, if you want to follow the steps.
When you start Open (Google) Refine, you are presented with the Create Project screen. The screen reads:
Create a project by importing data. What kinds of data files can I import? TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents are all supported. Support for other formats can be added with Google Refine extensions.
We will create two projects, so this process will be repeated. Be sure to keep track of the project names, as we will need to cross-reference data from one to the other later on. My project titles are as follows:
On the Create Project page, click the Browse button next to Get Data From: This Computer
At the project settings screen,
To create a new project, click the Google Refine logo.
For the nten_existing_email_list project, repeat the previous steps, using the file existing_email_list.csv
Your project should look similar to the following picture.
We will first clean up the email addresses with a common text transformation:
To access the Trim leading and trailing whitespace function:
You should see a yellow indicator box describing the action and how many cells were affected.
There are several rows with duplicate email addresses. To do this, we will sort the email column alphabetically, and show only rows with duplicate entries.
You should see the following dialogue.
Click OK to sort the column.
Note: Make sure “Show as: ” is set to Rows, not Records
We will now show only rows that contain duplicate emails, within the import data. We will keep the most detailed record, flagging the other records for later faceting.
Now that we have flagged the duplicate rows with insufficient details, we can remove the duplicates facet.
We now want to show only rows without a flag.
OK, so we’ve narrowed down our data, and cleaned it up a bit. Lets go ahead and filter out rows that match existing email addresses in the nten_existing_emails project. We will be using a bit of programming here, but it is just cut-and-paste.
The small script we will use calls on three functions
Each function has a pair of parenthesis. These parenthesis accept stuff, like a coin slot. When you put a coin in (calling the function), the function does something, and gives something back (returns something) to be used later in the program. For example, I put a coin in the slot and, hopefully, receive a green gumball. For what it’s worth, some functions work without any coin being inserted, you can just ‘bump’ them while some functions take several coins to get started.
We will create a new column with a value of ‘New’ or ‘Existing’ for each row. Email addresses that aren’t in the nten_existing_emails project, will have a value of ‘New’ in this column. Email addresses that are in the nten_existing_emails project, will have a value of ‘Existing’ in this column.
In the dialogue box that poops up,
if ( length ( cross( cell ,'nten_existing_email_list' ,'email' ) ) > 0 , "Duplicate" , "New" )
We should now have a column indicating whether email addresses are new or already exist in the NTEN database.
We can now narrow down our dataset to show only new emails.
A new facet will appear in the Facets / Filters sidebar.
Now that we have successfully cleaned up the data and narrowed it down to only new email records, we can export!
This data (nten_new_data_import-refined.csv) can now be used for further analysis, import into a database, etc.
The following is the cross-reference code, with inline comments.
// Brief explanatory notes if ( // Check a condition and perform one of two actions length( // Look at the length of the cross reference row cross( // Cross reference another project, return existing row cell // using the value of each cell ,'nten_existing_email_list' // looking in the 'nten_email_list' project ,'email' // 'nten_email_list' email column ) ) > 0 // if the length of the cross reference row is greater than zero , "Duplicate" // The cross referenced email is a duplicate , "New" // The cross-referenced email is new )
Grid alignment is a tool to help users discover content quickly. A well designed grid helps us scan content for meaningful patterns, and identify important elements.
Grids are comprised of two primary elements:
I am working on configuring the Mail Exchange settings for PacificyearlyMeeting.org, and am tracking things like DNS reverse lookup, and other necessary configuration steps. The number of considerations is quite confusing, but I have found a great tool that checks for multiple problems and explains what each test means. The tool is called the MX Toolbox Super Tool, and it has been really informative.
To use the MX ToolBox Super Tool, simply enter the IP address or domain name of the server in question and let the tool scan for mis-configurations, blacklisting, etc. When something seems confusing or unknown, click the link next to the error or warning for an explanation. This tool has been educational!