D.C.

Well… I missed my return flight to Oregon, after the Nonprofit Technology Conference. I went to Dulles International instead of Regan.

While waiting for the bus back to D.C., I met a man from Ethiopia. We traveled together on the bus and light rail to meet his cousin. The three of us went to dinner at an authentic Ethopian restaurant in Maryland.

After dinner, I checked in to the Hilltop Hostel.

Quick Guide For Firefox OS App Development

If you are interested in producing applications for the mobile web, please consider building using native technologies such as HTML, CSS, and JavaScript.

Firefox OS is a mobile operating system built on web standard technologies.

To get started with Firefox OS, I recommend the following book:

Developing for Firefox OS - Quick Guide

Developing for Firefox OS – Quick Guide

Quick Guide For Firefox OS App Development by Andre Garzia.

“Learn how easy and quick it is to develop applications for Firefox OS, the new mobile operating system by Mozilla. Empowered by this books practical approach you will learn thru examples how to develop apps from the beginning all the way to the distribution in the Firefox Markeplace.”

You can download the book for free, or choose to pay the author a reasonable sum for his work :-)

Human friendly SPAM prevention using Honeypot and timestamps

Form SPAM is common with pubic websites. SPAM bots have been trained to fill out any form they encounter, with common fields such as ‘email’, ‘name’, etc..

There are several approaches to dealing with form spam. A common approach is to put up hard-to-read words on the form, called CAPTCHA, and have the user type the letters in a field to prove they are human. This approach is losing its  usefulness, as SPAM bots are able to correctly detect the letters sometimes. It also punishes the humans, who have to strain to read the letters. This detracts usability diminishes accessibility.

An alternative approach places a hidden field on the form that only SPAM bots can see. When a form is submitted with this field, the software automatically knows it is junk. This field is called a ‘honeypot’. The honeypot technique can be combined with other heuristic tests to determine spam-botti-ness, such as a threshold of time taken to fill out a form. For example, the average human may take 30-60 seconds to fill out a form, while a SPAM bot can SPAM the form dozens of times a minute. If the system detects frequent, or very fast, submissions, it can flag the submission as SPAM.

Drupal has a project, called Honeypot, that provides a non-invasive approach to SPAM detection, using the methods outlined above.
https://drupal.org/project/honeypot

NTEN Data Cleanup using LibreOffice and Open (Google) Refine

NTEN sample and existing data

NTEN has given me some example data to clean up for a mock import into their constituent database. The data are five files:

The import data have several inconsistencies, and need to be cross-referenced against the existing NTEN data to avoid importing duplicates.

LibreOffice and Open Refine (Google Refine)

For this project, LibreOffice and Open (Google) Refine are used. They are both freely available, and are part of our commonwealth.

LibreOffice productivity suite

LibreOffice productivity suite

OpenRefine - A free, open source power tool for working with messy data.

OpenRefine – A free, open source power tool for working with messy data.

LibreOffice

The steps in this section are demonstrated using LibreOffice. However, any spreadsheet software will work similarly.

Inconsistent headers and data structure

A quick overview of the data structure indicates that it is necessary to re-arrange the columns and normalize the column headers.

Data structure - inconsistent capitalization and wording in column headers. Missing columns in one or more data sets.

Data structure – inconsistent capitalization and wording in column headers. Missing columns in one or more data sets.

The data are merged into a single CSV file (NewData-Merged-ImportColumnsOnly), with consistent headers. The merged CSV also contains only necessary columns, as not all of the columns from the original data are to be imported.

A quick, visual scan of the ‘email’ column indicates that there is leading whitespace, and possibly trailing whitespace, that will be removed.

Import data merged into single CSV. Column headers normalized (spaces removed, lowercase). Removed unnecessary columns.

Import data merged into single CSV. Column headers normalized (spaces removed, lowercase). Remove unnecessary columns.

Comparison data – existing email list

The existing email list contains 40,256 entries. These entries need to be scanned to see any of the import data match. The column header ‘email’ is added for consistency.

Original email list containing 40,256 entries. These entries will need to be compared against the import data to ensure that no duplicate data is imported.

Original email list contains 40,256 entries. These entries need to be compared against the import data to ensure that no duplicate data is imported.

OpenRefine

The following sections describe steps to clean and refine the data using Open (Google) Refine. It is assumed that OpenRefine is downloaded on your local computer, if you want to follow the steps.

Create projects for current email list and new data import

When you start Open (Google) Refine, you are presented with the Create Project screen. The screen reads:

Create a project by importing data. What kinds of data files can I import? TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents are all supported. Support for other formats can be added with Google Refine extensions.

We will create two projects, so this process will be repeated. Be sure to keep track of the  project names, as we will need to cross-reference data from one to the other later on. My project titles are as follows:

  • nten_new_data_import
  • nten_existing_email_list

nten_new_data_import project

On the Create Project page, click the Browse button next to Get Data From: This Computer

OpenRefine Create Project screen.

OpenRefine Create Project screen.

  1. Select the file NewData-Merged-ImportColumnsOnly.csv
  2. Click Open
  3. Click Next
Settings for the nten_new_data_import project.

Settings for the nten_new_data_import project.

At the project settings screen,

  1. Give the project a title, such as nten_new_data_import
  2. Verify all of the settings match the CSV fie (or other data)
  3. Click Create Project

 nten_existing_email_list project

To create a new project, click the Google Refine logo.

For the nten_existing_email_list project, repeat the previous steps, using the file existing_email_list.csv

Your project should look similar to the following picture.

existing_email_list project page.

existing_email_list project page.

 Clean-up and refine nten_new_data_import project

We will first clean up the email addresses with a common text transformation:

  • Trim leading and trailing whitespace – removes spaces and tabs from either end of a field value.
Trim leading and trailing whitespace for the 'email' column

Trim leading and trailing whitespace for the ‘email’ column

To access the Trim leading and trailing whitespace function:

  1. Click the down arrow in the ‘email’ column
  2. Hover over Edit Cells
  3. Hover over Common Transformations
  4. Click Trim leading and trailing whitespace

You should see a yellow indicator box describing the action and how many cells were affected.

Remove rows with duplicate email addresses, keeping most detailed row

There are several rows with duplicate email addresses. To do this, we will sort the email column alphabetically, and show only rows with duplicate entries.

Sort the email column

Email column - drop-down menu - sort menu.

Email column – drop-down menu – sort menu.

  1. Click the down arrow on the ‘email’ column header
  2. Hover over Sort
  3. Click “Sort…” (in the sub-menu)

You should see the following dialogue.

Sort dialogue for the email column.

Sort dialogue for the email column.

Click OK to sort the column.

Note: Make sure “Show as: ” is set to Rows, not Records

Facet the email column by ‘duplicates’

We will now show only rows that contain duplicate emails, within the import data. We will keep the most detailed record, flagging the other records for later faceting.

Duplicates facet from the email drop-down menu.

Duplicates facet from the email drop-down menu.

  1. Click the down arrow in the email column header
  2. Hover over Facet
  3. Hover over Customized Facets
  4. Click Duplicates Facet
  5. In the Facet/Filter column you should see an email box  appear
  6. Click true

Flag duplicates with insufficient details, leaving one of each duplicate

  1. Scan through the duplicate rows clicking the Flag icon next to rows with few details
  2. Leave at least one row for each duplicate email, this row should have the most details compared to other duplicate rows
  • Note: Make sure “Show as: ” is set to Rows, not Records
  • Note: You may want to set “Show: ” to a higher value, such as 25
Flagged duplicates - 4 of 12 were unique, with relatively good details

Flagged duplicates – 4 of 12 were unique, with relatively good details

Remove the duplicate email facet, add a facet for flagged rows

Now that we have flagged the duplicate rows with insufficient details, we can remove the duplicates facet.

  1. Click the X next to the email facet title, in the left sidebar
  2. Observe that the flags are still in place
  3. Click the down arrow in the All column header
  4. Hover over Facet
  5. Click Facet by flag
Drop-down menu for All column showing "Facet by flag"

Drop-down menu for All column showing “Facet by flag”

We now want to show only rows without a flag.

  1. Locate the Flagged rows box under Facet / Filter on the left hand sidebar
  2. Click false – this shows only rows that haven’t been flagged
Import data faceted for rows that have not been flagged.

Import data faceted for rows that have not been flagged.

Flag existing emails based on email list

OK, so we’ve narrowed down our data, and cleaned it up a bit. Lets go ahead and filter out rows that match existing email addresses in the nten_existing_emails project. We will be using a bit of programming here, but it is just cut-and-paste.

The small script we will use calls on three functions

  • cross() – cross references a column in another project. We will use this to check if an email address exists in the nten_existing_emails project
  • length() – checks the length of something. This can be a list of things, such as cells in a row. If a row has three cells (e.g. first_name, last_name, email) the length of the row will be three.
  • if() – checks if a statement is true or false. If the statement is true, something happens. If false, something different happens.

Each function has a pair of parenthesis. These parenthesis accept stuff, like a coin slot. When you put a coin in (calling the function), the function does something, and gives something back (returns something) to be used later in the program. For example, I put a coin in the slot and, hopefully, receive a green gumball. For  what it’s worth, some functions work without any coin being inserted, you can just ‘bump’ them while some functions take several coins to get started.

Add a new column based on email column and cross-reference with nten_existing_emails project

We will create a new column with a value of ‘New’ or ‘Existing’ for each row. Email addresses that aren’t in the nten_existing_emails project, will have a value of ‘New’ in this column. Email addresses that are in the nten_existing_emails project, will have a value of ‘Existing’ in this column.

Add a new column based on the email column.

Add a new column based on the email column.

  1. Click the down arrow in the email column header
  2. Hover over Edit column
  3. Click Add new column based on this column

In the dialogue box that poops up,

  1. Fill in the New column name with the value “new_or_existing”
  2. Paste the following code into the Expression text area
  3. Click OK
if (
  length (
    cross(
      cell
      ,'nten_existing_email_list'
      ,'email'
    )
  ) > 0
  , "Duplicate"
  , "New"
)

We should now have a column indicating whether email addresses are new or already exist in the NTEN database.

Facet based on new_or_existing column to show only new emails

We can now narrow down our dataset to show only new emails.

Text facet based on new_or_existing column.

Text facet based on new_or_existing column.

  1. Click on the down arrow in the new_or_existing column header
  2. Hover over Facet
  3. Click Text facet

A new facet will appear in the Facets / Filters sidebar.

  1. Locate the new_or_existing facet in the Facets / Filters sidebar
  2. Click new
Import data showing only new emails, thanks to the new_or_existing column.

Import data showing only new emails, thanks to the new_or_existing column.

 

Export the data

Now that we have successfully cleaned up the data and narrowed it down to only new email records, we can export!

  1. In the upper right hand corner, click the Export button
  2. Click to select the export format you desire (in this case, I’m selecting CSV)
Exporting the refined data as CSV

Exporting the refined data as CSV

This data (nten_new_data_import-refined.csv) can now be used for further analysis, import into a database, etc.

Cheers!

The following is the cross-reference code, with inline comments.

// Brief explanatory notes
if (    // Check a condition and perform one of two actions
  length(    // Look at the length of the cross reference row
    cross(    // Cross reference another project, return existing row
      cell    // using the value of each cell
      ,'nten_existing_email_list'    // looking in the 'nten_email_list' project
      ,'email'    // 'nten_email_list' email column
    )
  ) > 0    // if the length of the cross reference row is greater than zero
  , "Duplicate"    // The cross referenced email is a duplicate
  , "New"    // The cross-referenced email is new
)

Grid Design Notes

Grid alignment is  a tool to help users discover content quickly. A well designed grid helps us scan content for meaningful patterns, and identify important elements.

Grids are comprised of two primary elements:

  • columns – vertical divisions of the layout
  • gutters – space between columns

MX, DNS, Blacklist SuperTool

I am working on configuring the Mail Exchange settings for PacificyearlyMeeting.org, and am tracking things like DNS reverse lookup, and other necessary configuration steps. The number of considerations is quite confusing, but I have found a great tool that checks for multiple problems and explains what each test means. The tool is called the MX Toolbox Super Tool, and it has been really informative.

To use the MX ToolBox Super Tool, simply enter the IP address or domain name of the server in question and let the tool scan for mis-configurations, blacklisting, etc. When something seems confusing or unknown, click the link next to the error or warning for an explanation. This tool has been educational!