SeattleHackathonMicrosoft1

Big Data and Machine Learning Hackathon using Azure ML at Microsoft

It was an awesome experience to participate in the Big Data and Machine Learning Hackathon using Azure ML, Power BI and other tools available on Microsoft Azure.

Our team came out as a winner for implementing end-to-end solution using Microsoft Azure toolsets.

We used Hadoop cluster in HDInsights, R Programming, Azure ML Studio, Service Bus, Stream Analytics, Azure Websites, finally Azure Power BI. See below for more details about the solution.

About the team

It was a four member team Roy Budiantra, Timmy Liu (both drove down to Seattle all the way from Vancouver), Manish Gupta and myself (met during the week long data science boot camp). We formed the team on the spot at the venue.


In the picture from left to right: Roy Budiantra, Manish Gupta, Timmy Liu, Scott Klein– Microsoft Azure Evangelist, Joyjeet Dey Majumdar 

Prior to this we spent at least 56+ hours during a week at Data Science hands-on boot camp organized by Raja Iqbal of Data Science Dojo. A lot of credit goes to him for preparing us as a Data Scientist that helped us identify a good model for the solution. He is a qualified data scientist, now an entrepreneur, who has spent most his career analyzing data for Microsoft Bing Ad relevance and data mining team.

Problem Statement

We were asked to identify data from the Seattle City’s data.seattle.gov website, that should be used to identify trends and predict behavior using the Azure toolsets. We as a team decided to work on Seattle elementary school data to predict how external factor may effect the schools ranking. It will also help the schools to be prepared for future need such as maintaining a healthy teacher – student ratio, focusing at areas that will help prosper a student or even reduce cost if needed.

Architecture and Design

Below image illustrates the high level design.

We used data from two sources

  1. Seattle City website i.e. data.seattle.gov, and
  2. 10 years worth of School Data for the sate of Washington from School Digger website.

First we used Azure ML to clean the data using R Script, from one of the sources, then put them in the Hive table running on Hadoop cluster in HDInsights.

Then we used Azure ML and R Script to clean the data from two disparate sources. Thanks to Neeraj Khanchandani, Principal Group Program Manager at Microsoft for his help in R Script for the below cleansing model.

Here is the “R script” for those who are interested, that we used to merge data from two disparate sources.

# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame
dataset2 <- maml.mapInputPort(2) # class: data.frame

# For loop was used to update name column of schools
# to making it same inin both the datasets.
for(i in seq_along(dataset1[,3]))
{
x<-grep( dataset1[i,3],dataset2[,2])
dataset1[i,3]<-dataset2[x[1],2]
}

data.set<-data.frame(dataset1)

# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort(“data.set”);

Finally, we used Azure ML to train the model with 7 years of data and tested it with 3 years of data, using 70:30 split. We tried various machine learning algorithm such as Boosted Decision Tree RegressionDecision Forest RegressionRank Model Temp, but finally settled with Ranker Final algorithm as it provided the lowest error variation on the training data.

We then published it as an API to be consumed by external applications.

We published an azure website that can be used to send live data that may effect ranking of a school, using Service Bus and Stream Analytic event, that will enable Azure Power BI to show how it effect ranking of a school. Below is an image of the ranking data as it can be seen on the Power BI Dashboard.

All these were achieved in less than 24 hours.

I can now take my knowledge on Big Data and Machine Learning to implement intelligent model and solution at work to positively impact customers and information techonolgy operation as a Data Scientist.