Data munging.

obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning.

Data wrangling/data jujitsu/data munging.

three types of tasks:

1. preparing to run a model.(gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging)

2. running the model

3. communicating the results

Data Products

Data-driven apps.(spellchecker, machine translator)

interactive visualization. (google flu application, global burden of disease)

online databases. (enterprise data warehouse, sloan digital sky survey)



Distinguish Data Science from BI, statistics, data management, visualization, and machine learning.

Screen Shot 2014-07-20 at 11.24.30 PM


What do you need to do:

1. learn to deal with unstructured data,

2. data not fit in memory

3. statistical modeling and how to communicate results

4. algorithms and trade offs at scale


tools <————-> abstraction(Implementation)

Hadoop               MapReduce

PostgreSQL        Relational Algebra

glm(..) in R           Logistic Regression

Tableau               InfoVis


Structures <———-> statistics

Management             Analysis

Relational Algebra     Linear Algebra

Standards                  ad hoc files


desktop <————-> cloud

main memory           distributed

R                              Hadoop

local files                  S3, Azure storage


hackers <————-> analysts

assume                      assume little programming

proficiency in

python, java, R


eScience = Data Science



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s