### DATA SCIENCE OVERVIEW

**Overview/Description**

Data science differentiates itself from academic statistics and application programming by using what it needs from a variety of disciplines. In this course, you’ll explore what it is to be a data scientist and study what sets data science apart from other disciplines. It prepares learners to navigate the foundational elements of data science.

**Target Audience**

Individuals with some programming and math experience working toward implementing data science in their everyday work

**Data Science Overview**

- start the course
- define data science and what it is to be a data scientist
- describe the data wrangling aspect of data science
- describe the big data aspect of data science
- describe the machine learning aspect of data science
- use common data science terminology
- recognize ways to communicate results of your data science
- recall the steps in data science analysis
- compare various tools and software libraries used for data science

### DATA GATHERING

**Overview/Description**

To carry out data science, you need to gather data. Extracting, parsing, and scraping data from various sources, both internal and external, is a critical first part in the data science pipeline. In this course, you’ll explore examples of practical tools for data gathering.

**Target Audience**

Individuals with some programming and math experience working toward implementing data science in their everyday work

**Data Gathering**

- start the course
- describe problems and software tools associated with data gathering
- use curl to gather data from the Web
- use in2csv to convert spreadsheet data to CSV format
- use agate to extract data from spreadsheets
- use agate to extract tabular data from dbf files
- extract data from particular tags in an HTML document
- distinguish between metadata and data
- work with metadata in HTTP Headers
- work with Linux log files
- work with metadata in email headers
- perform a secure shell connection to a remote server
- copy remote data using a secure copy
- synchronize data from a remote server
- download an HTML file and explore table data

### DATA FILTERING

**Overview/Description**

Once data is gathered for data science it is often in an unstructured or raw format. Data must be filtered for content and validity. In this course, you’ll explore examples of practical tools and techniques for data filtering.

**Target Audience**

Individuals with some programming and math experience working toward implementing data science in their everyday work

**Data Filtering**

- start the course
- identify common filtering techniques and tools
- extract date elements from common date formats
- parse content types in HTTP headers
- use csvcut to filter CSV data
- use sed to replace values in a text data stream
- drop duplicate records from data
- extract headers from a jpeg image
- use pdfgrep to extract data from searchable pdf files
- detect invalid or impossible data combinations
- parse robots.txt from a web site to decide what should and shouldn’t be crawled nor indexed
- drop records from a CSV file based on date range

### DATA TRANSFORMATION

**Overview/Description**

Once data is filtered the next step is to transform it into a usable format. In this course, you’ll explore examples of practical tools and techniques for data transformation.

**Target Audience**

Individuals with some programming and math experience working toward implementing data science in their everyday work

**Data Transformation**

- start the course
- convert CSV data to JSON format
- convert XML data to JSON format
- create SQL inserts from CSV data
- extract CSV data from SQL
- change delimiters in a csv file from commas to tabs
- convert basic date formats to standard ISO 8601 format
- convert numeric formats within a CSV document
- round floating point decimals to two places within a CSV document
- use optical character recognition (OCR) to extract text from a jpeg image
- use optical character recognition (OCR) to extract text from a pdf document
- read various date formats and convert to standard compliant ISO 8601 format

### DATA EXPLORATION

**Overview/Description**

Once data is transformed into a useable format, the next step is to carry out preliminary data exploration on the data. In this course, you’ll explore examples of practical tools and techniques for data exploration.

**Target Audience**

Individuals with some programming and math experience working toward implementing data science in their everyday work

**Data Exploration**

- start the course
- use csvgrep to explore data in CSV data
- use csvstat to explore values in CSV data
- use csvsql to query CSV data like a SQL database
- use gnuplot to quickly plot data on the command line
- use wc to count words, characters, and lines within a text file
- explore a subdirectory tree from the command line
- use natural language processing to count word frequencies in a text document
- take random samples from a list of records
- find the top rows by value and percent in a data set
- find repeated records in a data set
- identify outliers using standard deviation
- perform a word frequency count on a classic book from Project Gutenberg

### DATA INTEGRATION

**Overview/Description**

Data integration is the last step in the data wrangling process where data is put into its useable and structured format for analysis. In this course, you’ll explore examples of practical tools and techniques for data integration.

**Target Audience**

Individuals with some programming and math experience working toward implementing data science in their everyday work

**Data Integration**

- start the course
- use csvjoin to concatenate CSV data
- use the cat function to concatenate separate logs into a single file
- sort lines in a text file
- merge separate xml files into a single schema
- aggregate data from a CSV file into a table of summarized values
- normalize data from unstructured sources
- denormalize data from a structured source
- use pivot tables to cross tabulate data
- insert missing values in a data set
- use csvjoin to merge two compatible CSV documents into one

### DATA ANALYSIS CONCEPTS

**Overview/Description**

There are many software and programming tools available to data scientists. Before applying those tools effectively, you must understand the underlying concepts. In this course, you’ll explore the underlying data analysis concepts needed to employ the software and programming tools effectively

**Target Audience**

Individuals with some programming and math experience working toward implementing data science in their everyday work

**Data Analysis Concepts**

- start the course
- perform basic math operations required by data scientists
- perform basic vector math operations required by data scientists
- perform basic matrix math operations required by data scientists
- perform a matrix decomposition
- identify different forms of data
- describe probability in terms of events and sample space size
- describe basic properties of outcomes
- apply probability rules in calculation
- identify common continuous probability distributions
- identify common discrete probability distributions
- apply bayes theorem and describe how it is used in email spam algorithms
- apply random sampling to A/B tests
- identify and describe various statistical measures
- describe the difference between an unbiased and biased estimator
- describe sampling distributions and recognize the central limit theorem
- define confidence intervals and work with margins of error
- carrying out hypothesis tests and working with p-values
- apply the chi-square test for categorical values
- identify the given data set descriptions by their types

### DATA CLASSIFICATION AND MACHINE LEARNING

**Overview/Description**

Machine learning is a particular area of data science that uses techniques to create models from data without being explicitly programmed. In this course, you’ll explore the conceptual elements of various machine learning techniques.

**Target Audience**

Individuals with some programming and math experience working toward implementing data science in their everyday work

**Data Classification and Machine Learning**

- start the course
- identify problems in which supervised learning techniques apply
- identify problems in which unsupervised learning techniques apply
- apply linear regression to machine learning problems
- identify predictors in machine learning
- apply logistic regression to machine learning problems
- describe the use of dummy variables
- use naive bayes classification techniques
- work with decision trees
- describe K-means clustering
- define cluster validation
- define principal component analysis
- describe machine learning errors
- describe underfitting
- describe overfitting
- apply k-folds cross validation
- describe fall-forward and back-propagation in neural networks
- describe SVMs and their use
- choose the appropriate machine learning method for the given example problems

### DATA COMMUNICATION AND VISUALIZATION

**Overview/Description**

The final step in the data science pipeline is to communicate the results or findings. In this course, you’ll explore communication and visualization concepts needed by data scientists.

**Target Audience**

Individuals with some programming and math experience working toward implementing data science in their everyday work

**Data Communication and Visualization**

- start the course
- choose appropriate visualization techniques
- describe the difference between correlation and causation
- define Simpson’s paradox
- communicate data science results informally
- communicate data science results formally
- implement strategies for effective data communication
- use scatter plots
- use line graphs
- use bar charts
- use histograms
- use box plots
- create a network visualization
- create a bubble plot
- create an interactive plot
- find an appropriate data set in which a scatter plot represents it visually and plot it