Machine Learning,

Getting Started with GraphLab and SFrames

GraphLab is a Python library that gives many out of the box features to use. It is a great library to learn the Machine Learning foundations. Many courses out there teaches several algorithms with bunch of tools, and non real world examples. However if you are new to Machine Learning, GraphLab(powered by DATO) is a great library to start.

In this post I’ll try to give some intuition about SFrames and I’ll show some simple data visualization examples using iPython Notebooks.

First go ahead and download GraphLab Create from https://dato.com/products/create/ . If you are a student you can use GraphLab Create for 1 year at no charge for academic purposes.  https://dato.com/download/academic.html .  After downloading and installing GraphLab Create, launch iPython notebooks.  Also here is a simple data set that I’ll use for the rest of my post. people-example.csv

# Import the GraphLab library
import graphlab

# Then initialize SFrames variable. This will hold our sample data. 
sf = graphlab.SFrame('people-example.csv')

You will have an output very similar to below

Finished parsing file /Users/muhammetergenc/people-example.csv
Parsing completed. Parsed 7 lines in 0.055689 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/muhammetergenc/people-example.csv
Parsing completed. Parsed 7 lines in 0.047955 secs.

Now we have loaded our data and let’s start with basics.

SFrame Basics

# Type sf and press shift + enter on iPython Notebook

sf

Here is my output;

Python GraphLab create on ipython notebooks

sf.head() function will also fetch the few lines from the beginning of the file. You can also use sf.tail() function to retrieve few lines of data from the end of the file. However because we don’t have that much records in our dataset, the output of those 3 functions will be the same.

Screen Shot 2016-06-19 at 7.26.33 PMScreen Shot 2016-06-19 at 7.26.39 PM

GraphLab Canvas

Graph Lab Canvas is a built in visualization tool that comes with GraphLab Create.

# We can take any data structure in GraphLab Canvas.
# We will use our sample data for the following examples.

sf.show()

You will have an output which will redirect you to the Canvas web application.

Here is my output;

Screen Shot 2016-06-19 at 7.32.39 PM

You can click on each column and see the most frequent items. Also in Table view you can view your data in a clean and very nicely structured way. SFrames are not storing the data in memory. So you may even view 1 billion of rows in GraphLab Canvas.

Here are some more simple operations;

# Set the target as iPython notebook to view visualization directly in your notebook. 
graphlab.canvas.set_target('ipynb')

#View the age column's visualization in iPython notebook in categorical format. 
sf['age'].show(view='Categorical')

#We can also calculate the mean value or the max value of the age column. 
sf['age'].mean()
sf['age'].max()

Create new columns in our SFrame

sf['Full Name'] = sf['First Name'] + ' ' + sf['Last Name']

This code will create a new column that consists of the First Name and the Last Name columns.

Screen Shot 2016-06-19 at 7.45.30 PM

If you noticed in our Country column we have United States for some rows and USA for some other rows. We can write a function and and use it in a for loop to fix this problem for each row.  However there is a more clean and neat way to to this in GraphLab Canvas.

Advanced Transformation of our Data

Let’s write a function that will change ‘USA’ to ‘United States’.

def transform_country(country): 
    if country == 'USA':
        return 'United States'
    else:
        return country

Now in a next line if you try

transform_country('Turkey')
#You will get an output as Turkey

But if you try

transform_country('USA')
#You will get an output as "United States"

Let’s apply this to all the rows in our dataset.

sf['Country'] = sf['Country'].apply(transform_country)

Now print out our data set again by typing sf.

Graphlab normalize data

Now we have cleaned our data and added a new row very easily using GraphLab Canvas!

 

0no comment

writer

The author didnt add any Information to his profile yet

Leave a Reply

en_USEnglish
tr_TRTurkish en_USEnglish
%d bloggers like this: