Fragile Families Challenge

What is Fragile Families Challenge

Fragile families challenge (FFC) is a community challenge that all participants share their methodology and code to achieve one common goal to understand the struggles and improve the lives of disadvantaged children in the United States. You can find much more information about the challenge from its official website.

My participation to this important challenge started with an email that I received in April 2017. Matt Salganik from Princeton University was visiting our campus to give a talk about his research and upcoming challenge that he is organizing with Ian Lundberg and Sara McLanahan. I also highly recommend this book Bit by Bit: Social Research in the Digital Age :) He was also organizing a “getting started workshop” in our campus. This was one of the driving force for me to participate this challenge. Another interesting aspect of this challenge is the opportunity to combine social science and computational approaches to address this problem.

In his slides here, Matt gives a great introduction to the challenge, but he also highlights why we (social and computational scientists) should work on such problems together. I really liked his comparison of these two groups: social scientists ( $\hat{\beta}$ ) and data scientists ( $\hat{y}$ ).

My approach on the problem

In this problem, my approach based on machine learning methodlogies where I can employ learning models that can describe underlying data. My expectation is to learn from the models about factors important for the predicting certain outcomes.

Data preprocessing

Before diving into difficulties of this dataset, let me first describe what type of data we studied for this challenge. After completing the agreement for the data usage and downloading the dataset, I had 2 main files: background.csv and train.csv

Background file consist of 4,242 rows (one per child) and 12,943 columns. Columns in this data represent answers for survey questions. Design of the survey consist of 5 waves from birth to age 9. Training data have 2,121 rows (one per child in the training set) and 6 outcome for age 15. Our task is to use available features among 12,943 column to predict 6 outcomes for kids at the age of 15.

First thing I noticed when I started to analyze this dataset, it has significant amount of missing values. These missing values usually systematic and one should take this into account.

My initial attemp is to discriminate different types of features. Features about different waves, parent, categories, etc. I implement couple of helper functions to be able to select features from different waves or specifically answers from mother etc.

After analyzing different categories of features, I decided to remove the set of features that is not available for all the users. I could work on dealing with the missing values, but since we have plenty of survey answer decided not to focus on that. However, there are some recommended appraoches here.

Building models

I built classification and regression models for different outcomes. Categories “gpa”, “eviction”, and “material hardship” requires continuos values so my choice for this task was Random Forest Regression and other 3 categories I fit Random Forest Classification.

I tried different set of features. I explored different waves, different parents, etc. My best performing model to predict GPA is using all the available feature in wave 5 (3,236 features available with this criteria). General model used in this analysis sketched below.

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

Evaluation of submissions

For the evaluation I followed their performance metric: MSE. Organizers describe their system for evaluation here. Below is the general formula of how they evaluate different outcomes

I employed cross validation mechanism to estimate performance of model on the traning dataset. Later I build a single model using all the available traning data.

Between different submission I tracked the leader board score, my submission score computed on heldout dataset, and my previous attemps. Please note that my estimation is based on the training data and actual leader board results evaluated on heldout dataset. Below is my analysis on different models that I submitted. I always keep the best model that I tested in the final submission.

Once organizers announced the results, I analyzed my model to highlight features found to be most relevant. Below is the top scoring features.

Feature name	Description
hv5_wj10ss	Woodcock Johnson Test 10 standard score
cm5hhinc	Constructed - Mother’s Household income
hv5_ppvtss	PPVT standard score
hv5_ppvtraw	PPVT raw score
cf5hhincb	Constructed - Household income mother report for married/cohab if available
t5c13c	c13C. Child’s mathematical skills

Notes and take aways

After building some off-the-shelf methods to estimate outcome of kids in this dataset, I am still not feeling that I gained much intuition about the cause of problem. I am happy to learn more about $\hat{\beta}$ approach in the workshop :) I hope community effort will help to build ensemble of models that can point some interesting insights about the problem.

My repository is also publicly available through Github: github.com/onurvarol/FragileFamiliesChallenge

My notes from the conference

I am going to share my notes from the FFC scientific workshop, which will take place November 16th and 17th (Thursday and Friday) at Princeton University. I am also going to give a talk about my approach described in this blog post at the workshop.

Details about the event is here.

Slides of my presentation is located here

Written on November 15, 2017