datama 
Data Mining and Analysis 
28 hours 
Objective:
Delegates be able to analyse big data sets, extract patterns, choose the right variable impacting the results so that a new model is forecasted with predictive results.
Data preprocessing
Data Cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Statistical inference
Probability distributions, Random variables, Central limit theorem
Sampling
Confidence intervals
Statistical Inference
Hypothesis testing
Multivariate linear regression
Specification
Subset selection
Estimation
Validation
Prediction
Classification methods
Logistic regression
Linear discriminant analysis
Knearest neighbours
Naive Bayes
Comparison of Classification methods
Neural Networks
Fitting neural networks
Training neural networks issues
Decision trees
Regression trees
Classification trees
Trees Versus Linear Models
Bagging, Random Forests, Boosting
Bagging
Random Forests
Boosting
Support Vector Machines and Flexible disct
Maximal Margin classifier
Support vector classifiers
Support vector machines
2 and more classes SVM’s
Relationship to logistic regression
Principal Components Analysis
Clustering
Kmeans clustering
Kmedoids clustering
Hierarchical clustering
Density based clustering
Model Assesment and Selection
Bias, Variance and Model complexity
Insample prediction error
The Bayesian approach
Crossvalidation
Bootstrap methods

rintrob 
Introductory R for Biologists 
28 hours 
I. Introduction and preliminaries
1. Overview
Making R more friendly, R and available GUIs
Rstudio
Related software and documentation
R and statistics
Using R interactively
An introductory session
Getting help with functions and features
R commands, case sensitivity, etc.
Recall and correction of previous commands
Executing commands from or diverting output to a file
Data permanency and removing objects
Good programming practice: Selfcontained scripts, good readability e.g. structured scripts, documentation, markdown
installing packages; CRAN and Bioconductor
2. Reading data
Txt files (read.delim)
CSV files
3. Simple manipulations; numbers and vectors + arrays
Vectors and assignment
Vector arithmetic
Generating regular sequences
Logical vectors
Missing values
Character vectors
Index vectors; selecting and modifying subsets of a data set
Arrays
Array indexing. Subsections of an array
Index matrices
The array() function + simple operations on arrays e.g. multiplication, transposition
Other types of objects
4. Lists and data frames
Lists
Constructing and modifying lists
Concatenating lists
Data frames
Making data frames
Working with data frames
Attaching arbitrary lists
Managing the search path
5. Data manipulation
Selecting, subsetting observations and variables
Filtering, grouping
Recoding, transformations
Aggregation, combining data sets
Forming partitioned matrices, cbind() and rbind()
The concatenation function, (), with arrays
Character manipulation, stringr package
short intro into grep and regexpr
6. More on Reading data
XLS, XLSX files
readr and readxl packages
SPSS, SAS, Stata,… and other formats data
Exporting data to txt, csv and other formats
6. Grouping, loops and conditional execution
Grouped expressions
Control statements
Conditional execution: if statements
Repetitive execution: for loops, repeat and while
intro into apply, lapply, sapply, tapply
7. Functions
Creating functions
Optional arguments and default values
Variable number of arguments
Scope and its consequences
8. Simple graphics in R
Creating a Graph
Density Plots
Dot Plots
Bar Plots
Line Charts
Pie Charts
Boxplots
Scatter Plots
Combining Plots
II. Statistical analysis in R
1. Probability distributions
R as a set of statistical tables
Examining the distribution of a set of data
2. Testing of Hypotheses
Tests about a Population Mean
Likelihood Ratio Test
One and twosample tests
ChiSquare GoodnessofFit Test
KolmogorovSmirnov OneSample Statistic
Wilcoxon SignedRank Test
TwoSample Test
Wilcoxon Rank Sum Test
MannWhitney Test
KolmogorovSmirnov Test
3. Multiple Testing of Hypotheses
Type I Error and FDR
ROC curves and AUC
Multiple Testing Procedures (BH, Bonferroni etc.)
4. Linear regression models
Generic functions for extracting model information
Updating fitted models
Generalized linear models
Families
The glm() function
Classification
Logistic Regression
Linear Discriminant Analysis
Unsupervised learning
Principal Components Analysis
Clustering Methods(kmeans, hierarchical clustering, kmedoids)
5. Survival analysis (survival package)
Survival objects in r
KaplanMeier estimate, logrank test, parametric regression
Confidence bands
Censored (interval censored) data analysis
Cox PH models, constant covariates
Cox PH models, timedependent covariates
Simulation: Model comparison (Comparing regression models)
6. Analysis of Variance
OneWay ANOVA
TwoWay Classification of ANOVA
MANOVA
III. Worked problems in bioinformatics
Short introduction to limma package
Microarray data analysis workflow
Data download from GEO: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1397
Data processing (QC, normalisation, differential expression)
Volcano plot
Custering examples + heatmaps

dmmlr 
Data Mining & Machine Learning with R 
14 hours 
Introduction to Data mining and Machine Learning
Statistical learning vs. Machine learning
Iteration and evaluation
BiasVariance tradeoff
Regression
Linear regression
Generalizations and Nonlinearity
Exercises
Classification
Bayesian refresher
Naive Bayes
Dicriminant analysis
Logistic regression
KNearest neighbors
Support Vector Machines
Neural networks
Decision trees
Exercises
Crossvalidation and Resampling
Crossvalidation approaches
Bootstrap
Exercises
Unsupervised Learning
Kmeans clustering
Examples
Challenges of unsupervised learning and beyond Kmeans
Advanced topics
Ensemble models
Mixed models
Boosting
Examples
Multidimensional reduction
Factor Analysis
Principal Component Analysis
Examples

rprogda 
R Programming for Data Analysis 
14 hours 
This course is part of the Data Scientist skill set (Domain: Data and Technology)
Introduction and preliminaries
Making R more friendly, R and available GUIs
Rstudio
Related software and documentation
R and statistics
Using R interactively
An introductory session
Getting help with functions and features
R commands, case sensitivity, etc.
Recall and correction of previous commands
Executing commands from or diverting output to a file
Data permanency and removing objects
Simple manipulations; numbers and vectors
Vectors and assignment
Vector arithmetic
Generating regular sequences
Logical vectors
Missing values
Character vectors
Index vectors; selecting and modifying subsets of a data set
Other types of objects
Objects, their modes and attributes
Intrinsic attributes: mode and length
Changing the length of an object
Getting and setting attributes
The class of an object
Arrays and matrices
Arrays
Array indexing. Subsections of an array
Index matrices
The array() function
The outer product of two arrays
Generalized transpose of an array
Matrix facilities
Matrix multiplication
Linear equations and inversion
Eigenvalues and eigenvectors
Singular value decomposition and determinants
Least squares fitting and the QR decomposition
Forming partitioned matrices, cbind() and rbind()
The concatenation function, (), with arrays
Frequency tables from factors
Lists and data frames
Lists
Constructing and modifying lists
Concatenating lists
Data frames
Making data frames
attach() and detach()
Working with data frames
Attaching arbitrary lists
Managing the search path
Data manipulation
Selecting, subsetting observations and variables
Filtering, grouping
Recoding, transformations
Aggregation, combining data sets
Character manipulation, stringr package
Reading data
Txt files
CSV files
XLS, XLSX files
SPSS, SAS, Stata,… and other formats data
Exporting data to txt, csv and other formats
Accessing data from databases using SQL language
Probability distributions
R as a set of statistical tables
Examining the distribution of a set of data
One and twosample tests
Grouping, loops and conditional execution
Grouped expressions
Control statements
Conditional execution: if statements
Repetitive execution: for loops, repeat and while
Writing your own functions
Simple examples
Defining new binary operators
Named arguments and defaults
The '...' argument
Assignments within functions
More advanced examples
Efficiency factors in block designs
Dropping all names in a printed array
Recursive numerical integration
Scope
Customizing the environment
Classes, generic functions and object orientation
Graphical procedures
Highlevel plotting commands
The plot() function
Displaying multivariate data
Display graphics
Arguments to highlevel plotting functions
Basic visualisation graphs
Multivariate relations with lattice and ggplot package
Using graphics parameters
Graphics parameters list
Automated and interactive reporting
Combining output from R with text

mdlmrah 
Model MapReduce and Apache Hadoop 
14 hours 
The course is intended for IT specialist that works with the distributed processing of large data sets across clusters of computers.
Data Mining and Business Intelligence
Introduction
Area of application
Capabilities
Basics of data exploration
Big data
What does Big data stand for?
Big data and Data mining
MapReduce
Model basics
Example application
Stats
Cluster model
Hadoop
What is Hadoop
Installation
Configuration
Cluster settings
Architecture and configuration of Hadoop Distributed File System
Console tools
DistCp tool
MapReduce and Hadoop
Streaming
Administration and configuration of Hadoop On Demand
Alternatives

bigddbsysfun 
Big Data & Database Systems Fundamentals 
14 hours 
The course is part of the Data Scientist skill set (Domain: Data and Technology).
Data Warehousing Concepts
What is Data Ware House?
Difference between OLTP and Data Ware Housing
Data Acquisition
Data Extraction
Data Transformation.
Data Loading
Data Marts
Dependent vs Independent data Mart
Data Base design
ETL Testing Concepts:
Introduction.
Software development life cycle.
Testing methodologies.
ETL Testing Work Flow Process.
ETL Testing Responsibilities in Data stage.
Big data Fundamentals
Big Data and its role in the corporate world
The phases of development of a Big Data strategy within a corporation
Explain the rationale underlying a holistic approach to Big Data
Components needed in a Big Data Platform
Big data storage solution
Limits of Traditional Technologies
Overview of database types
NoSQL Databases
Hadoop
Map Reduce
Apache Spark 
sspsspas 
Statistics with SPSS Predictive Analytics Software 
14 hours 
Goal:
Learning to work with SPSS at the level of independence
The addressees:
Analysts, researchers, scientists, students and all those who want to acquire the ability to use SPSS package and learn popular data mining techniques.
Using the program
The dialog boxes
input / downloading data
the concept of variable and measuring scales
preparing a database
Generate tables and graphs
formatting of the report
Command language syntax
automated analysis
storage and modification procedures
create their own analytical procedures
Data Analysis
descriptive statistics
Key terms: eg variable, hypothesis, statistical significance
measures of central tendency
measures of dispersion
measures of central tendency
standardization
Introduction to research the relationships between variables
correlational and experimental methods
Summary: This case study and discussion

datavis1 
Data Visualization 
28 hours 
This course is intended for engineers and decision makers working in data mining and knoweldge discovery.
You will learn how to create effective plots and ways to present and represent your data in a way that will appeal to the decision makers and help them to understand hidden information.
Day 1:
what is data visualization
why it is important
data visualization vs data mining
human cognition
HMI
common pitfalls
Day 2:
different type of curves
drill down curves
categorical data plotting
multi variable plots
data glyph and icon representation
Day 3:
plotting KPIs with data
R and X charts examples
what if dashboards
parallel axes mixing
categorical data with numeric data
Day 4:
different hats of data visualization
how can data visualization lie
disguised and hidden trends
a case study of student data
visual queries and region selection

bdbiga 
Big Data Business Intelligence for Govt. Agencies 
35 hours 
Advances in technologies and the increasing amount of information are transforming how business is conducted in many industries, including government. Government data generation and digital archiving rates are on the rise due to the rapid growth of mobile devices and applications, smart sensors and devices, cloud computing solutions, and citizenfacing portals. As digital information expands and becomes more complex, information management, processing, storage, security, and disposition become more complex as well. New capture, search, discovery, and analysis tools are helping organizations gain insights from their unstructured data. The government market is at a tipping point, realizing that information is a strategic asset, and government needs to protect, leverage, and analyze both structured and unstructured information to better serve and meet mission requirements. As government leaders strive to evolve datadriven organizations to successfully accomplish mission, they are laying the groundwork to correlate dependencies across events, people, processes, and information.
Highvalue government solutions will be created from a mashup of the most disruptive technologies:
Mobile devices and applications
Cloud services
Social business technologies and networking
Big Data and analytics
IDC predicts that by 2020, the IT industry will reach $5 trillion, approximately $1.7 trillion larger than today, and that 80% of the industry's growth will be driven by these 3rd Platform technologies. In the long term, these technologies will be key tools for dealing with the complexity of increased digital information. Big Data is one of the intelligent industry solutions and allows government to make better decisions by taking action based on patterns revealed by analyzing large volumes of data — related and unrelated, structured and unstructured.
But accomplishing these feats takes far more than simply accumulating massive quantities of data.“Making sense of thesevolumes of Big Datarequires cuttingedge tools and technologies that can analyze and extract useful knowledge from vast and diverse streams of information,” Tom Kalil and Fen Zhao of the White House Office of Science and Technology Policy wrote in a post on the OSTP Blog.
The White House took a step toward helping agencies find these technologies when it established the National Big Data Research and Development Initiative in 2012. The initiative included more than $200 million to make the most of the explosion of Big Data and the tools needed to analyze it.
The challenges that Big Data poses are nearly as daunting as its promise is encouraging. Storing data efficiently is one of these challenges. As always, budgets are tight, so agencies must minimize the permegabyte price of storage and keep the data within easy access so that users can get it when they want it and how they need it. Backing up massive quantities of data heightens the challenge.
Analyzing the data effectively is another major challenge. Many agencies employ commercial tools that enable them to sift through the mountains of data, spotting trends that can help them operate more efficiently. (A recent study by MeriTalk found that federal IT executives think Big Data could help agencies save more than $500 billion while also fulfilling mission objectives.).
Customdeveloped Big Data tools also are allowing agencies to address the need to analyze their data. For example, the Oak Ridge National Laboratory’s Computational Data Analytics Group has made its Piranha data analytics system available to other agencies. The system has helped medical researchers find a link that can alert doctors to aortic aneurysms before they strike. It’s also used for more mundane tasks, such as sifting through résumés to connect job candidates with hiring managers.
Each session is 2 hours
Day1: Session 1: Business Overview of Why Big Data Business Intelligence in Govt.
Case Studies from NIH, DoE
Big Data adaptation rate in Govt. Agencies & and how they are aligning their future operation around Big Data Predictive Analytics
Broad Scale Application Area in DoD, NSA, IRS, USDA etc.
Interfacing Big Data with Legacy data
Basic understanding of enabling technologies in predictive analytics
Data Integration & Dashboard visualization
Fraud management
Business Rule/ Fraud detection generation
Threat detection and profiling
Cost benefit analysis for Big Data implementation
Day1: Session2 : Introduction of Big Data1
Main characteristics of Big Datavolume, variety, velocity and veracity. MPP architecture for volume.
Data Warehouses – static schema, slowly evolving dataset
MPP Databases like Greenplum, Exadata, Teradata, Netezza, Vertica etc.
Hadoop Based Solutions – no conditions on structure of dataset.
Typical pattern : HDFS, MapReduce (crunch), retrieve from HDFS
Batch suited for analytical/noninteractive
Volume : CEP streaming data
Typical choices – CEP products (e.g. Infostreams, Apama, MarkLogic etc)
Less production ready – Storm/S4
NoSQL Databases – (columnar and keyvalue): Best suited as analytical adjunct to data warehouse/database
Day1 : Session 3 : Introduction to Big Data2
NoSQL solutions
KV Store  Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
KV Store  Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB
KV Store (Hierarchical)  GT.m, Cache
KV Store (Ordered)  TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
KV Cache  Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua
Tuple Store  Gigaspaces, Coord, Apache River
Object Database  ZopeDB, DB40, Shoal
Document Store  CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XMLDatabases, ThruDB, CloudKit, Prsevere, RiakBasho, Scalaris
Wide Columnar Store  BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI
Varieties of Data: Introduction to Data Cleaning issue in Big Data
RDBMS – static structure/schema, doesn’t promote agile, exploratory environment.
NoSQL – semi structured, enough structure to store data without exact schema before storing data
Data cleaning issues
Day1 : Session4 : Big Data Introduction3 : Hadoop
When to select Hadoop?
STRUCTURED  Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not good for active exploration)
SEMI STRUCTURED data – tough to do with traditional solutions (DW/DB)
Warehousing data = HUGE effort and static even after implementation
For variety & volume of data, crunched on commodity hardware – HADOOP
Commodity H/W needed to create a Hadoop Cluster
Introduction to Map Reduce /HDFS
MapReduce – distribute computing over multiple servers
HDFS – make data available locally for the computing process (with redundancy)
Data – can be unstructured/schemaless (unlike RDBMS)
Developer responsibility to make sense of data
Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS
Day2: Session1: Big Data EcosystemBuilding Big Data ETL: universe of Big Data Toolswhich one to use and when?
Hadoop vs. Other NoSQL solutions
For interactive, random access to data
Hbase (column oriented database) on top of Hadoop
Random access to data but restrictions imposed (max 1 PB)
Not good for adhoc analytics, good for logging, counting, timeseries
Sqoop  Import from databases to Hive or HDFS (JDBC/ODBC access)
Flume – Stream data (e.g. log data) into HDFS
Day2: Session2: Big Data Management System
Moving parts, compute nodes start/fail :ZooKeeper  For configuration/coordination/naming services
Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain
Deploy, configure, cluster management, upgrade etc (sys admin) :Ambari
In Cloud : Whirr
Day2: Session3: Predictive analytics in Business Intelligence 1: Fundamental Techniques & Machine learning based BI :
Introduction to Machine learning
Learning classification techniques
Bayesian Predictionpreparing training file
Support Vector Machine
KNN pTree Algebra & vertical mining
Neural Network
Big Data large variable problem Random forest (RF)
Big Data Automation problem – Multimodel ensemble RF
Automation through Soft10M
Text analytic toolTreeminer
Agile learning
Agent based learning
Distributed learning
Introduction to Open source Tools for predictive analytics : R, Rapidminer, Mahut
Day2: Session4 Predictive analytics ecosystem2: Common predictive analytic problems in Govt.
Insight analytic
Visualization analytic
Structured predictive analytic
Unstructured predictive analytic
Threat/fraudstar/vendor profiling
Recommendation Engine
Pattern detection
Rule/Scenario discovery –failure, fraud, optimization
Root cause discovery
Sentiment analysis
CRM analytic
Network analytic
Text Analytics
Technology assisted review
Fraud analytic
Real Time Analytic
Day3 : Sesion1 : Real Time and Scalable Analytic Over Hadoop
Why common analytic algorithms fail in Hadoop/HDFS
Apache Hama for Bulk Synchronous distributed computing
Apache SPARK for cluster computing for real time analytic
CMU Graphics Lab2 Graph based asynchronous approach to distributed computing
KNN pAlgebra based approach from Treeminer for reduced hardware cost of operation
Day3: Session2: Tools for eDiscovery and Forensics
eDiscovery over Big Data vs. Legacy data – a comparison of cost and performance
Predictive coding and technology assisted review (TAR)
Live demo of a Tar product ( vMiner) to understand how TAR works for faster discovery
Faster indexing through HDFS –velocity of data
NLP or Natural Language processing –various techniques and open source products
eDiscovery in foreign languagestechnology for foreign language processing
Day3 : Session 3: Big Data BI for Cyber Security –Understanding whole 360 degree views of speedy data collection to threat identification
Understanding basics of security analyticsattack surface, security misconfiguration, host defenses
Network infrastructure/ Large datapipe / Response ETL for real time analytic
Prescriptive vs predictive – Fixed rule based vs autodiscovery of threat rules from Meta data
Day3: Session 4: Big Data in USDA : Application in Agriculture
Introduction to IoT ( Internet of Things) for agriculturesensor based Big Data and control
Introduction to Satellite imaging and its application in agriculture
Integrating sensor and image data for fertility of soil, cultivation recommendation and forecasting
Agriculture insurance and Big Data
Crop Loss forecasting
Day4 : Session1: Fraud prevention BI from Big Data in GovtFraud analytic:
Basic classification of Fraud analytics rule based vs predictive analytics
Supervised vs unsupervised Machine learning for Fraud pattern detection
Vendor fraud/over charging for projects
Medicare and Medicaid fraud fraud detection techniques for claim processing
Travel reimbursement frauds
IRS refund frauds
Case studies and live demo will be given wherever data is available.
Day4 : Session2: Social Media Analytic Intelligence gathering and analysis
Big Data ETL API for extracting social media data
Text, image, meta data and video
Sentiment analysis from social media feed
Contextual and noncontextual filtering of social media feed
Social Media Dashboard to integrate diverse social media
Automated profiling of social media profile
Live demo of each analytic will be given through Treeminer Tool.
Day4 : Session3: Big Data Analytic in image processing and video feeds
Image Storage techniques in Big Data Storage solution for data exceeding petabytes
LTFS and LTO
GPFSLTFS ( Layered storage solution for Big image data)
Fundamental of image analytics
Object recognition
Image segmentation
Motion tracking
3D image reconstruction
Day4: Session4: Big Data applications in NIH:
Emerging areas of Bioinformatics
Metagenomics and Big Data mining issues
Big Data Predictive analytic for Pharmacogenomics, Metabolomics and Proteomics
Big Data in downstream Genomics process
Application of Big data predictive analytics in Public health
Big Data Dashboard for quick accessibility of diverse data and display :
Integration of existing application platform with Big Data Dashboard
Big Data management
Case Study of Big Data Dashboard: Tableau and Pentaho
Use Big Data app to push location based services in Govt.
Tracking system and management
Day5 : Session1: How to justify Big Data BI implementation within an organization:
Defining ROI for Big Data implementation
Case studies for saving Analyst Time for collection and preparation of Data –increase in productivity gain
Case studies of revenue gain from saving the licensed database cost
Revenue gain from location based services
Saving from fraud prevention
An integrated spreadsheet approach to calculate approx. expense vs. Revenue gain/savings from Big Data implementation.
Day5 : Session2: Step by Step procedure to replace legacy data system to Big Data System:
Understanding practical Big Data Migration Roadmap
What are the important information needed before architecting a Big Data implementation
What are the different ways of calculating volume, velocity, variety and veracity of data
How to estimate data growth
Case studies
Day5: Session 4: Review of Big Data Vendors and review of their products. Q/A session:
Accenture
APTEAN (Formerly CDC Software)
Cisco Systems
Cloudera
Dell
EMC
GoodData Corporation
Guavus
Hitachi Data Systems
Hortonworks
HP
IBM
Informatica
Intel
Jaspersoft
Microsoft
MongoDB (Formerly 10Gen)
MU Sigma
Netapp
Opera Solutions
Oracle
Pentaho
Platfora
Qliktech
Quantum
Rackspace
Revolution Analytics
Salesforce
SAP
SAS Institute
Sisense
Software AG/Terracotta
Soft10 Automation
Splunk
Sqrrl
Supermicro
Tableau Software
Teradata
Think Big Analytics
Tidemark Systems
Treeminer
VMware (Part of EMC)

dsbda 
Data Science for Big Data Analytics 
35 hours 
Introduction to Data Science for Big Data Analytics
Data Science Overview
Big Data Overview
Data Structures
Drivers and complexities of Big Data
Big Data ecosystem and a new approach to analytics
Key technologies in Big Data
Data Mining process and problems
Association Pattern Mining
Data Clustering
Outlier Detection
Data Classification
Introduction to Data Analytics lifecycle
Discovery
Data preparation
Model planning
Model building
Presentation/Communication of results
Operationalization
Exercise: Case study
From this point most of the training time (80%) will be spent on examples and exercises in R and related big data technology.
Getting started with R
Installing R and Rstudio
Features of R language
Objects in R
Data in R
Data manipulation
Big data issues
Exercises
Getting started with Hadoop
Installing Hadoop
Understanding Hadoop modes
HDFS
MapReduce architecture
Hadoop related projects overview
Writing programs in Hadoop MapReduce
Exercises
Integrating R and Hadoop with RHadoop
Components of RHadoop
Installing RHadoop and connecting with Hadoop
The architecture of RHadoop
Hadoop streaming with R
Data analytics problem solving with RHadoop
Exercises
Preprocessing and preparing data
Data preparation steps
Feature extraction
Data cleaning
Data integration and transformation
Data reduction – sampling, feature subset selection,
Dimensionality reduction
Discretization and binning
Exercises and Case study
Exploratory data analytic methods in R
Descriptive statistics
Exploratory data analysis
Visualization – preliminary steps
Visualizing single variable
Examining multiple variables
Statistical methods for evaluation
Hypothesis testing
Exercises and Case study
Data Visualizations
Basic visualizations in R
Packages for data visualization ggplot2, lattice, plotly, lattice
Formatting plots in R
Advanced graphs
Exercises
Regression (Estimating future values)
Linear regression
Use cases
Model description
Diagnostics
Problems with linear regression
Shrinkage methods, ridge regression, the lasso
Generalizations and nonlinearity
Regression splines
Local polynomial regression
Generalized additive models
Regression with RHadoop
Exercises and Case study
Classification
The classification related problems
Bayesian refresher
Naïve Bayes
Logistic regression
Knearest neighbors
Decision trees algorithm
Neural networks
Support vector machines
Diagnostics of classifiers
Comparison of classification methods
Scalable classification algorithms
Exercises and Case study
Assessing model performance and selection
Bias, Variance and model complexity
Accuracy vs Interpretability
Evaluating classifiers
Measures of model/algorithm performance
Holdout method of validation
Crossvalidation
Tuning machine learning algorithms with caret package
Visualizing model performance with Profit ROC and Lift curves
Ensemble Methods
Bagging
Random Forests
Boosting
Gradient boosting
Exercises and Case study
Support vector machines for classification and regression
Maximal Margin classifiers
Support vector classifiers
Support vector machines
SVM’s for classification problems
SVM’s for regression problems
Exercises and Case study
Identifying unknown groupings within a data set
Feature Selection for Clustering
Representative based algorithms: kmeans, kmedoids
Hierarchical algorithms: agglomerative and divisive methods
Probabilistic base algorithms: EM
Density based algorithms: DBSCAN, DENCLUE
Cluster validation
Advanced clustering concepts
Clustering with RHadoop
Exercises and Case study
Discovering connections with Link Analysis
Link analysis concepts
Metrics for analyzing networks
The Pagerank algorithm
HyperlinkInduced Topic Search
Link Prediction
Exercises and Case study
Association Pattern Mining
Frequent Pattern Mining Model
Scalability issues in frequent pattern mining
Brute Force algorithms
Apriori algorithm
The FP growth approach
Evaluation of Candidate Rules
Applications of Association Rules
Validation and Testing
Diagnostics
Association rules with R and Hadoop
Exercises and Case study
Constructing recommendation engines
Understanding recommender systems
Data mining techniques used in recommender systems
Recommender systems with recommenderlab package
Evaluating the recommender systems
Recommendations with RHadoop
Exercise: Building recommendation engine
Text analysis
Text analysis steps
Collecting raw text
Bag of words
Term Frequency –Inverse Document Frequency
Determining Sentiments
Exercises and Case study

d2dbdpa 
From Data to Decision with Big Data and Predictive Analytics 
21 hours 
Audience
If you try to make sense out of the data you have access to or want to analyse unstructured data available on the net (like Twitter, Linked in, etc...) this course is for you.
It is mostly aimed at decision makers and people who need to choose what data is worth collecting and what is worth analyzing.
It is not aimed at people configuring the solution, those people will benefit from the big picture though.
Delivery Mode
During the course delegates will be presented with working examples of mostly open source technologies.
Short lectures will be followed by presentation and simple exercises by the participants
Content and Software used
All software used is updated each time the course is run so we check the newest versions possible.
It covers the process from obtaining, formatting, processing and analysing the data, to explain how to automate decision making process with machine learning.
Quick Overview
Data Sources
Minding Data
Recommender systems
Target Marketing
Datatypes
Structured vs unstructured
Static vs streamed
Attitudinal, behavioural and demographic data
Datadriven vs userdriven analytics
data validity
Volume, velocity and variety of data
Models
Building models
Statistical Models
Machine learning
Data Classification
Clustering
kGroups, kmeans, nearest neighbours
Ant colonies, birds flocking
Predictive Models
Decision trees
Support vector machine
Naive Bayes classification
Neural networks
Markov Model
Regression
Ensemble methods
ROI
Benefit/Cost ratio
Cost of software
Cost of development
Potential benefits
Building Models
Data Preparation (MapReduce)
Data cleansing
Choosing methods
Developing model
Testing Model
Model evaluation
Model deployment and integration
Overview of Open Source and commercial software
Selection of Rproject package
Python libraries
Hadoop and Mahout
Selected Apache projects related to Big Data and Analytics
Selected commercial solution
Integration with existing software and data sources

processmining 
Process Mining 
21 hours 
Process mining, or Automated Business Process Discovery (ABPD), is a technique that applies algorithms to event logs for the purpose of analyzing business processes. Process mining goes beyond data storage and data analysis; it bridges data with processes and provides insights into the trends and patterns that affect process efficiency.
Format of the course
The course starts with an overview of the most commonly used techniques for process mining. We discuss the various process discovery algorithms and tools used for discovering and modeling processes based on raw event data. Reallife case studies are examined and data sets are analyzed using the ProM opensource framework.
Audience
Data science professionals
Anyone interested in understanding and applying process modeling and data mining
Overview
Discovering, analyzing and rethinking your processes
Types of process mining
Discovery, conformance and enhancement
Process mining workflow
From log data analysis to response and action
Other tools for process mining
PMLAB, Apromoro
Commercial offerings
Closing remarks 
dataminr 
Data Mining with R 
14 hours 
Sources of methods
Artificial intelligence
Machine learning
Statistics
Sources of data
Pre processing of data
Data Import/Export
Data Exploration and Visualization
Dimensionality Reduction
Dealing with missing values
R Packages
Data mining main tasks
Automatic or semiautomatic analysis of large quantities of data
Extracting previously unknown interesting patterns
groups of data records (cluster analysis)
unusual records (anomaly detection)
dependencies (association rule mining)
Data mining
Anomaly detection (Outlier/change/deviation detection)
Association rule learning (Dependency modeling)
Clustering
Classification
Regression
Summarization
Frequent Pattern Mining
Text Mining
Decision Trees
Regression
Neural Networks
Sequence Mining
Frequent Pattern Mining
Data dredging, data fishing, data snooping 
kdd 
Knowledge Discover in Databases (KDD) 
21 hours 
Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. Reallife applications for this data mining technique include marketing, fraud detection, telecommunication and manufacturing.
In this course, we introduce the processes involved in KDD and carry out a series of exercises to practice the implementation of those processes.
Audience
Data analysts or anyone interested in learning how to interpret data to solve problems
Format of the course
After a theoretical discussion of KDD, the instructor will present reallife cases which call for the application of KDD to solve a problem. Participants will prepare, select and cleanse sample data sets and use their prior knowledge about the data to propose solutions based on the results of their observations.
Introduction
KDD vs data mining
Establishing the application domain
Establishing relevant prior knowledge
Understanding the goal of the investigation
Creating a target data set
Data cleaning and preprocessing
Data reduction and projection
Choosing the data mining task
Choosing the data mining algorithms
Interpreting the mined patterns 
psr 
Introduction to Recommendation Systems 
7 hours 
Audience
Marketing department employees, IT strategists and other people involved in decisions related to the design and implementation of recommender systems.
Format
Short theoretical background follow by analysing working examples and short, simple exercises.
Challenges related to data collection
Information overload
Data types (video, text, structured data, etc...)
Potential of the data now and in the near future
Basics of Data Mining
Recommendation and searching
Searching and Filtering
Sorting
Determining weights of the search results
Using Synonyms
Fulltext search
Long Tail
Chris Anderson idea
Drawbacks of Long Tail
Determining Similarities
Products
Users
Documents and web sites
ContentBased Recommendation i measurement of similarities
Cosine distance
The Euclidean distance vectors
TFIDF and frequency of terms
Collaborative filtering
Community rating
Graphs
Applications of graphs
Determining similarity of graphs
Similarity between users
Neural Networks
Basic concepts of Neural Networks
Training Data and Validation Data
Neural Network examples in recommender systems
How to encourage users to share their data
Making systems more comfortable
Navigation
Functionality and UX
Case Studies
Popularity of recommender systems and their problems
Examples

druid 
Druid: Build a fast, realtime data analysis system 
21 hours 
Druid is an opensource, columnoriented, distributed data store written in Java. It was designed to quickly ingest massive quantities of event data and execute lowlatency OLAP queries on that data. Druid is commonly used in business intelligence applications to analyze high volumes of realtime and historical data. It is also well suited for powering fast, interactive, analytic dashboards for endusers. Druid is used by companies such as Alibaba, Airbnb, Cisco, eBay, Netflix, Paypal, and Yahoo.
In this course we explore some of the limitations of data warehouse solutions and discuss how Druid can compliment those technologies to form a flexible and scalable streaming analytics stack. We walk through many examples, offering participants the chance to implement and test Druidbased solutions in a lab environment.
Audience
Application developers
Software engineers
Technical consultants
DevOps professionals
Architecture engineers
Format of the course
Part lecture, part discussion, heavy handson practice, occasional tests to gauge understanding
Introduction
Installing and starting Druid
Druid architecture and design
Realtime ingestion of event data
Sharding and indexing
Loading data
Querying data
Visualizing data
Running a distributed cluster
Druid + Apache Hive
Druid + Apache Kafka
Druid + others
Troubleshooting
Administrative tasks 
pmml 
Predictive Models with PMML 
7 hours 
The course is created to scientific, developers, analysts or any other people who want to standardize or exchange their models with Predictive Model Markup Language (PMML) file format.Predictive Models
Intro to predictive models
Predictive models supported by PMML
PMML Elements
Header
Data Dictionary
Data Transformations
Model
Mining Schema
Targets
Output
API
Overview of API providers for PMML
Executing your model in a cloud

BigData_ 
A practical introduction to Data Analysis and Big Data 
28 hours 
Participants who complete this training will gain a practical, realworld understanding of Big Data and its related technologies, methodologies and tools.
Participants will have the opportunity to put this knowledge into practice through handson exercises. Group interaction and instructor feedback make up an important component of the class.
The course starts with an introduction to elemental concepts of Big Data, then progresses into the programming languages and methodologies used to perform Data Analysis. Finally, we discuss the tools and infrastructure that enable Big Data storage, Distributed Processing, and Scalability.
Audience
Developers / programmers
IT consultants
Format of the course
Part lecture, part discussion, heavy handson practice and implementation, occasional quizing to measure progress.
Introduction to Data Analysis and Big Data
What makes Big Data "big"?
Velocity, Volume, Variety, Veracity (VVVV)
Limits to traditional Data Processing
Distributed Processing
Statistical Analysis
Types of Machine Learning Analysis
Data Visualization
Distributed Processing
MapReduce
Languages used for Data Analysis
R language (crash course)
Python (crash course)
Approaches to Data Analysis
Statistical Analysis
Time Series analysis
Forecasting with Correlation and Regression models
Inferential Statistics (estimating)
Descriptive Statistics in Big Data sets (e.g. calculating mean)
Machine Learning
Supervised vs unsupervised learning
Classification and clustering
Estimating cost of specific methods
Filter
Natural Language Processing
Processing text
Understaing meaning of the text
Automatic text generation
Sentiment/Topic Analysis
Computer Vision
Big Data infrastructure
Data Storage
Relational databases (SQL)
MySQL
Postgres
Oracle
Nonrelational databases (NoSQL)
Cassandra
MongoDB
Neo4js
Understanding the nuances: hierarchical, objectoriented, documentoriented, graphoriented, etc.
Distributed File Systems
HDFS
Search Engines
ElasticSearch
Distributed Processing
Spark
Machine Learning libraries: MLlib
Spark SQL
Scalability
Public cloud
AWS, Google, Aliyun, etc.
Private cloud
OpenStack, Cloud Foundry, etc.
Autoscalability
Choosing right solution for the problem

68780 
Apache Spark 
14 hours 
Why Spark?
Problems with Traditional LargeScale Systems
Introducing Spark
Spark Basics
What is Apache Spark?
Using the Spark Shell
Resilient Distributed Datasets (RDDs)
Functional Programming with Spark
Working with RDDs
RDD Operations
KeyValue Pair RDDs
MapReduce and Pair RDD Operations
The Hadoop Distributed File System
Why HDFS?
HDFS Architecture
Using HDFS
Running Spark on a Cluster
Overview
A Spark Standalone Cluster
The Spark Standalone Web UI
Parallel Programming with Spark
RDD Partitions and HDFS Data Locality
Working With Partitions
Executing Parallel Operations
Caching and Persistence
RDD Lineage
Caching Overview
Distributed Persistence
Writing Spark Applications
Spark Applications vs. Spark Shell
Creating the SparkContext
Configuring Spark Properties
Building and Running a Spark Application
Logging
Spark, Hadoop, and the Enterprise Data Center
Overview
Spark and the Hadoop Ecosystem
Spark and MapReduce
Spark Streaming
Spark Streaming Overview
Example: Streaming Word Count
Other Streaming Operations
Sliding Window Operations
Developing Spark Streaming Applications
Common Spark Algorithms
Iterative Algorithms
Graph Analysis
Machine Learning
Improving Spark Performance
Shared Variables: Broadcast Variables
Shared Variables: Accumulators
Common Performance Issues

neo4j 
Beyond the relational database: neo4j 
21 hours 
Relational, tablebased databases such as Oracle and MySQL have long been the standard for organizing and storing data. However, the growing size and fluidity of data have made it difficult for these traditional systems to efficiently execute highly complex queries on the data. Imagine replacing rowsandcolumnsbased data storage with objectbased data storage, whereby entities (e.g., a person) could be stored as data nodes, then easily queried on the basis of their vast, multilinear relationship with other nodes. And imagine querying these connections and their associated objects and properties using a compact syntax, up to 20 times lighter than SQL? This is what graph databases, such as neo4j offer.
In this handson course, we will set up a live project and put into practice the skills to model, manage and access your data. We contrast and compare graph databases with SQLbased databases as well as other NoSQL databases and clarify when and where it makes sense to implement each within your infrastructure.
Audience
Database administrators (DBAs)
Data analysts
Developers
System Administrators
DevOps engineers
Business Analysts
CTOs
CIOs
Format of the course
Heavy emphasis on handson practice. Most of the concepts are learned through samples, exercises and handson development.
Getting started with neo4j
neo4j vs relational databases
neo4j vs other NoSQL databases
Using neo4j to solve real world problems
Installing neo4j
Data modeling with neo4j
Mapping whiteboard diagrams and mind maps to neo4j
Working with nodes
Creating, changing and deleting nodes
Defining node properties
Node relationships
Creating and deleting relationships
Bidirectional relationships
Querying your data with Cypher
Querying your data based on relationships
MATCH, RETURN, WHERE, REMOVE, MERGE, etc.
Setting indexes and constraints
Working with the REST API
REST operations on nodes
REST operations on relationships
REST operations on indexes and constraints
Accessing the core API for application development
Working with NET, Java, Javascript, Python APIs
Closing remarks

datamin 
Data Mining 
21 hours 
Course can be provided with any tools, including free opensource data mining software and applicationsIntroduction
Data mining as the analysis step of the KDD process ("Knowledge Discovery in Databases")
Subfield of computer science
Discovering patterns in large data sets
Sources of methods
Artificial intelligence
Machine learning
Statistics
Database systems
What is involved?
Database and data management aspects
Data preprocessing
Model and inference considerations
Interestingness metrics
Complexity considerations
Postprocessing of discovered structures
Visualization
Online updating
Data mining main tasks
Automatic or semiautomatic analysis of large quantities of data
Extracting previously unknown interesting patterns
groups of data records (cluster analysis)
unusual records (anomaly detection)
dependencies (association rule mining)
Data mining
Anomaly detection (Outlier/change/deviation detection)
Association rule learning (Dependency modeling)
Clustering
Classification
Regression
Summarization
Use and applications
Able Danger
Behavioral analytics
Business analytics
Cross Industry Standard Process for Data Mining
Customer analytics
Data mining in agriculture
Data mining in meteorology
Educational data mining
Human genetic clustering
Inference attack
Java Data Mining
Opensource intelligence
Path analysis (computing)
Reactive business intelligence
Data dredging, data fishing, data snooping 
datashrinkgov 
Data Shrinkage for Government 
14 hours 
Why shrink data
Relational databases
Introduction
Aggregation and disaggregation
Normalisation and denormalisation
Null values and zeroes
Joining data
Complex joins
Cluster analysis
Applications
Strengths and weaknesses
Measuring distance
Hierarchical clustering
Kmeans and derivatives
Applications in Government
Factor analysis
Concepts
Exploratory factor analysis
Confirmatory factor analysis
Principal component analysis
Correspondence analysis
Software
Applications in Government
Predictive analytics
Timelines and naming conventions
Holdout samples
Weights of evidence
Information value
Scorecard building demonstration using a spreadsheet
Regression in predictive analytics
Logistic regression in predictive analytics
Decision Trees in predictive analytics
Neural networks
Measuring accuracy
Applications in Government

matlab2 
MATLAB Fundamentals 
21 hours 
This threeday course provides a comprehensive introduction to the MATLAB technical computing environment. The course is intended for beginning users and those looking for a review. No prior programming experience or knowledge of MATLAB is assumed. Themes of data analysis, visualization, modeling, and programming are explored throughout the course. Topics include:
Working with the MATLAB user interface
Entering commands and creating variables
Analyzing vectors and matrices
Visualizing vector and matrix data
Working with data files
Working with data types
Automating commands with scripts
Writing programs with logic and flow control
Writing functions
Part 1
A Brief Introduction to MATLAB
Objectives: Offer an overview of what MATLAB is, what it consists of, and what it can do for you
An Example: C vs. MATLAB
MATLAB Product Overview
MATLAB Application Fields
What MATLAB can do for you?
The Course Outline
Working with the MATLAB User Interface
Objective: Get an introduction to the main features of the MATLAB integrated design environment and its user interfaces. Get an overview of course themes.
MATALB Interface
Reading data from file
Saving and loading variables
Plotting data
Customizing plots
Calculating statistics and bestfit line
Exporting graphics for use in other applications
Variables and Expressions
Objective: Enter MATLAB commands, with an emphasis on creating and accessing data in variables.
Entering commands
Creating variables
Getting help
Accessing and modifying values in variables
Creating character variables
Analysis and Visualization with Vectors
Objective: Perform mathematical and statistical calculations with vectors, and create basic visualizations. See how MATLAB syntax enables calculations on whole data sets with a single command.
Calculations with vectors
Plotting vectors
Basic plot options
Annotating plots
Analysis and Visualization with Matrices
Objective: Use matrices as mathematical objects or as collections of (vector) data. Understand the appropriate use of MATLAB syntax to distinguish between these applications.
Size and dimensionality
Calculations with matrices
Statistics with matrix data
Plotting multiple columns
Reshaping and linear indexing
Multidimensional arrays
Part 2
Automating Commands with Scripts
Objective: Collect MATLAB commands into scripts for ease of reproduction and experimentation. As the complexity of your tasks increases, entering long sequences of commands in the Command Window becomes impractical.
A Modelling Example
The Command History
Creating script files
Running scripts
Comments and Code Cells
Publishing scripts
Working with Data Files
Objective: Bring data into MATLAB from formatted files. Because imported data can be of a wide variety of types and formats, emphasis is given to working with cell arrays and date formats.
Importing data
Mixed data types
Cell arrays
Conversions amongst numerals, strings, and cells
Exporting data
Multiple Vector Plots
Objective: Make more complex vector plots, such as multiple plots, and use color and string manipulation techniques to produce eyecatching visual representations of data.
Graphics structure
Multiple figures, axes, and plots
Plotting equations
Using color
Customizing plots
Logic and Flow Control
Objective: Use logical operations, variables, and indexing techniques to create flexible code that can make decisions and adapt to different situations. Explore other programming constructs for repeating sections of code, and constructs that allow interaction with the user.
Logical operations and variables
Logical indexing
Programming constructs
Flow control
Loops
Matrix and Image Visualization
Objective: Visualize images and matrix data in two or three dimensions. Explore the difference in displaying images and visualizing matrix data using images.
Scattered Interpolation using vector and matrix data
3D matrix visualization
2D matrix visualization
Indexed images and colormaps
True color images
Part 3
Data Analysis
Objective: Perform typical data analysis tasks in MATLAB, including developing and fitting theoretical models to reallife data. This leads naturally to one of the most powerful features of MATLAB: solving linear systems of equations with a single command.
Dealing with missing data
Correlation
Smoothing
Spectral analysis and FFTs
Solving linear systems of equations
Writing Functions
Objective: Increase automation by encapsulating modular tasks as userdefined functions. Understand how MATLAB resolves references to files and variables.
Why functions?
Creating functions
Adding comments
Calling subfunctions
Workspaces
Subfunctions
Path and precedence
Data Types
Objective: Explore data types, focusing on the syntax for creating variables and accessing array elements, and discuss methods for converting among data types. Data types differ in the kind of data they may contain and the way the data is organized.
MATLAB data types
Integers
Structures
Converting types
File I/O
Objective: Explore the lowlevel data import and export functions in MATLAB that allow precise control over text and binary file I/O. These functions include textscan, which provides precise control of reading text files.
Opening and closing files
Reading and writing text files
Reading and writing binary files
Note that the actual delivered might be subject to minor discrepancies from the outline above without prior notification.
Conclusion
Note that the actual delivered might be subject to minor discrepancies from the outline above without prior notification.
Objectives: Summarise what we have learnt
A summary of the course
Other upcoming courses on MATLAB
Note that the course might be subject to few minor discrepancies when being delivered without prior notifications. 
osqlide 
Oracle SQL Intermediate  Data Extraction 
14 hours 
Limiting results
The WHERE clause
Comparison operators
LIKE Condition
Prerequisite BETWEEN ... AND
IS NULL condition
Condition IN
Boolean operators AND, OR and NOT
Many of the conditions in the WHERE clause
The order of the operators.
DISTINCT clause
SQL functions
The differences between the functions of one and multilines
Features text, numeric, date,
Explicit and implicit conversion
Conversion functions
Nesting functions
Viewing the performance of the functions  dual table
Getting the current date function SYSDATE
Handling of NULL values
Aggregating data using the grouping function
Grouping functions
How grouping functions treat NULL values
Create groups of data  the GROUP BY clause
Grouping multiple columns
Limiting the function result grouping  the HAVING clause
Subqueries
Place subqueries in the SELECT command
Subqueries single and multilineage
Operators Subqueries singleline
Features grouping in subquery
Operators Subqueries multiIN, ALL, ANY
How NULL values are treated in subqueries
Operators collective
UNION operator
UNION ALL operator
INTERSECT operator
MINUS operator
Further Usage Of Joins
Revisit Joins
Combining Inner and Outer Joins
Partitioned Outer Joins
Hierarchical Queries
Further Usage Of SubQueries
Revisit subqueries
Use of subqueries as virtual tables/inline views and columns
Use of the WITH construction
Combining subqueries and joins
Analytics functions
OVER clause
Partition Clause
Windowing Clause
Rank, Lead, Lag, First, Last functions
Retrieving data from multiple tables (if time at end)
Types of connectors
The use NATURAL JOIN
Aliases tables
Joins in the WHERE clause
INNER JOIN Inner join
External Merge LEFT, RIGHT, FULL OUTER JOIN
Cartesian product
Aggregate Functions (if time at end)
Revisit Group By function and Having clause
Group and Rollup
Group and Cube
