Applied Predictive Modeling Blog

This is a continuation of original blog that we made for our previous book. New posts, as of 2024, can be found here. The posts here are not associated with our employers; opinions are our own.

Other places that we create content: the Tidyverse blog and tidymodels.org’s Learn pages.

Part 3 is Finished, Part 4 Started

We’ve released additional chapters in the last month or so. These conclude Part 3 of the book.

Progress Update (September 2024)

Three new chapters!

Progress Update (June 2024)

We just released a new set of chapters:

Data Usage with Postprocessing

This document is used to discuss and test ideas for how have can estimate and evaluate machine learning models that have three potential components:

Max Kuhn, Simon Couch

Post Hoc Nearest Neighbors Prediction Adjustments

Quinlan (1993) describes a post-processing technique used for numeric predictions that adjusts them using information from the training set.

February 2024 Talks

One keynote and one tutorial last month (both new).

Predictive Survival Analysis

Predictive survival models come to tidymodels.

Two New Preprocessing Chapters

We just released two new chapters: “Transforming Numeric Predictors” and “Working with Categorical Predictors.”

2024 Tidymodels User Survey

Tell us which features are most important to you.

WTF Article

Kjell and I have a new paper called “What They Forgot to Tell You about Machine Learning with an Application to Pharmaceutical Manufacturing.”

Progress Update (February 2024)

Since the last update on 2023-11-20, we have a few new sections and chapters.

New Location, Same Content

This is the new home for the Applied Predictive Modeling blog.

2022 tidymodels user survey

We are conducting another survey to see where users would like us to spend our development time.

tidymodels updates and voting!

While I’m still supporting caret, the majority of my development effort has gone into the tidyverse modeling packages (called tidymodels).

Slides from R/Pharma

My slides from the R/Pharma conference on “Modeling in the Tidyverse” are in pdf format as well as the HTML version.

R/Medicine conference

I’ll be giving a talk at the R/Medicine conference on Sept 7th in New Haven CT.

Podcast on Nonclinical Statistics

Hugo Bowne-Anderson and I spoke about about data science in pharmaceuticals, the tidyverse, and more for the excellent DataFramed podcast from DataCamp. Listen to it here or…

Early draft of our “Feature Engineering and Selection” book

Kjell and I are writing another book on predictive modeling, this time focused on all the things that you can do with predictors. It’s about 60% done and we’d love to get…

tidyposterior slides

tidyposterior is an R package for comparing models based on their resampling statistics. There are a few case studies on the webpage to illustrate the process.

New Workshop in Washington DC (August)

I’ll be conducting a workshop called “Applied Machine Learning” in Washington DC on August 15 and 16. The last one, at the RStudio conference, sold out quickly.

Tidy Resampling Redux with Agricultural Economics Data

(No statistical graphs in this one. This is what my dog Artemis looks like when she wants my attention during work hours.)

RStudio 2018 Conference Presentation and Materials

We’ve released our videos of the talks at the 2018 RStudio conference. My talk was Modeling in the Tidyverse (video) and I was also in the Tidyverse fireside chat (video).…

While you wait for that to finish, can I interest you in parallel processing?

caret has been able to utilize parallel processing for some time (before it was on CRAN in October 2007) using slightly different versions of the package. Around September…

Lots of Package News

I’ve sent a slew of packages to CRAN recently (thanks to Swetlana and Uwe). There are updates to:

caret Cheatsheet

It can be found on the RStudio cheatsheet page. Suggestions and pull requests are always welcome.

Nested Resampling with rsample

A typical scheme for splitting the data when developing a predictive model is to create an initial split of the data into a training and test set. If resampling is used, it…

Nonclinical Statistics Position in New England

I try to limit postings about jobs here, there is an interesting position in pharma for a statistician in New England.

Do Resampling Estimates Have Low Correlation to the Truth? The Answer May Shock You.

One criticism that is often leveled against using resampling methods (such as cross-validation) to measure model performance is that there is no correlation between the CV…

caret package plans

A few people have asked if anything is going to happen to caret now that I’m working at RStudio.

Working at RStudio

I’ve joined Hadley’s team at RStudio.

2016 UK Tour

I’ll be in the UK next week doing three talks in three days:

DataCamp Course [UPDATE]

Zachary Deane-Mayer, who collaborates on caret, has put together a DataCamp course on Machine Learning in R.

Boston R User Group Talk [UPDATE]

I’ll be giving a talk on Boston R user Group on Thursday March 10th at 6:00 PM. The talk will be on rule-based regression models.

Nonclinical Statistics Book

Springer has a new book (Amazon) edited by Lanju Zhang that captures the breadth of problems for statistics in the pharmaceutical industry including: compound optimization…

Central Iowa R User Group Talk [Updated]

I’ll be giving a talk (“Applied Predictive Modeling”) to the Central Iowa R User Group on Thursday night at 6:00 PM to 8:00 PM (CST).

Nonclinical Statistician Position at Pfizer

The Research Statistics group collaborates across a wide variety of activities in the early phases of drug discovery. This position is located in Groton CT and has a focus…

In Search Of…

Rafael Ladeira asked on github:

C5.0 Class Probability Shrinkage

(The image above has nothing do to with this post. It does, however, show the prize that my son won during a recent vacation to Virginia and how I got it back home).

The 2014 Ziegel Award

Last night in Seattle, Kjell and I were awarded the American Statistical Association’s Ziegel Award for the best book reviewed in Technometrics during 2014. Technometrics is…

Feature Engineering versus Feature Extraction: Game On!

“Feature engineering” is a fancy term for making sure that your predictors are encoded in the model in a manner that makes it as easy as possible for the model to achieve…

New caret Version (6.0-52)

A new version of caret (6.0-52) is on CRAN.

Slides from recent talks

I’ve been buried in work lately but thought I’d share the slides from two recent talks. The first is from the Bay Area RUG. Since someone filmed the talks, I was waiting to…

A Talk and Course in NYC Next Week

I’ll be giving talk on Tuesday February 17 (7:00PM-9:00PM) that will be an overview of predictive modeling. It will not be highly technical and here is the current outline:

Simulated Annealing Feature Selection

As previously mentioned, caret has two new feature selection routines based on genetic algorithms (GA) and simulated annealing (SA). The help pages for the two new functions…

Regression Solutions Available

The github page for the APM exercises has been updated with three new files for Chapters 6-8 (the section on regression).

New Version of caret on CRAN

A new version of caret is on CRAN.

My Research Tools

Pfizer has an excellent group of librarians and they recently contacted people, including a few statisticians, about how we find and organize article. I’ve spent…

Comparing the Bootstrap and Cross-Validation

This is the second of two posts about the performance characteristics of resampling methods. The first post focused on the cross-validation techniques and this post mostly…

Comparing Different Species of Cross-Validation

This is the first of two posts about the performance characteristics of resampling methods. I just had major shoulder surgery, but I’ve pre-seeded a few blog posts. More…

Solutions on github

See this page. We’re not done with them all but chapter 3 and 4 are there and the regression chapters are not too far behind.

Some Thoughts on “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?”

Sorry for the blogging break. I’ve got a few planned for the next few weeks based on some work I’ve been doing.

Exercise Solutions

I’m finally recovering form the summer and will start posing again soon.

useR! 2014 Highlights

(This article was originally posted at http://appliedpredictivemodeling.com)

New caret version with adaptive resampling

A new version of caret is on CRAN now.

A Tutorial and Talk at useR! 2014 [Important Update]

See the update below

Cross-validation pitfalls when selecting and assessing regression and classification models

Damjan Krstajic and friends have a great paper on pitfalls of cross-validation. Although the paper uses chemistry data, the meat of the article is broadly applicable. It…

Bay Area RUG Talk on 3/17 (updated)

I’m making my yearly pilgrimage to San Fransico to teach at PAW.

caret webinar materials

The webinar was recorded (thanks to Ray DiGiacomo and the Orange County RUG). The slides are here minus a few typos.

Sample Mislabeling and Boosted Trees

A while back, I saw this post on StackExchange/Crossvalidated: “Does anyone know how well C5.0 boosting performs in the presence of mislabeled data?” I did some simulations…

Optimizing Probability Thresholds for Class Imbalances

One of the toughest problems in predictive model occurs when the classes have a severe imbalance. We spend an entire chapter on this subject itself. One consequence of this…

caret webinar on Feb 25

I”ll be doing a webinar with the Orange County R User Group on the caret package on Tue, Feb 25, 2014 1:00 PM - 2:00 PM EST.

Calibration Affirmation

In the book, we discuss the notion of a probability model being “well calibrated”. There are many different mathematical techniques that classification models use to produce…

Down-Sampling Using Random Forests

We discuss dealing with large class imbalances in Chapter 16. One approach is to sample the training set to coerce a more balanced class distribution. We discuss

ASA Talk in New York This Thursday (11/14)

I’ll be giving a talk on predictive modeling for the American Statistical Association next Thursday (the 14th) :

The Basics of Encoding Categorical Data for Predictive Models

Thomas Yokota asked a very straight-forward question about encodings for categorical predictors: “Is it bad to feed it non-numerical data such as factors?” As usual, I will…

Exercises and Solutions

Kjell and I are putting together solutions to the exercise sections. It is a lot of work, so it may take some time.

Availability

After being on backorder for about 10 weeks, we are told that a larger batch of books have been printed and should be available shortly (despite the Amazon page’s note about…

Equivocal Zones

In Chapter 11, equivocal zones were briefly discussed. The idea is that some classification errors are close to the probability boundary (i.e. 50% for two class outcomes).…

UseR! Slides for “Classification Using C5.0”

I’ve had a lot of requests, so here they are. Hopefully, all of the slides will be posted on the conference website.

UseR! 2013 Highlights

The conference was excellent this year. My highlights:

Measuring Associations

In Chapter 18, we discuss a relatively new method for measuring predictor importance called the maximal information coefficient (MIC). The original paper is by Reshef at al…

type = “what”?

One great thing about R is that has a wide diversity of packages written by many different people of many different viewpoints on how software should be designed. However…

Feature Selection 3 - Swarm Mentality

“Bees don’t swarm in a mango grove for nothing. Where can you see a wisp of smoke without a fire?” - Hla Stavhana

33 Months Later

After starting in September 2010, the book is now available from the publisher. The Kindle edition is available and Amazon will ship hard copies around July 1st.

One Statistician’s View of Big Data

Recently I’ve had several questions about using machine learning models with large data sets. Here is a talk I gave at Yale’s Big Data Symposium on the subject.

Recent Changes to caret

Here is a summary of some recent changes to caret.

Projection Pursuit Classification Trees

I’ve been looking at this article for a new tree-based method. It uses other classification methods (e.g. LDA) to find a single variable use in the split and builds a tree…

Feature Selection 2 - Genetic Boogaloo

Previously, I talked about genetic algorithms (GA) for feature selection and illustrated the algorithm using a modified version of the GA R package and simulated data. The…

Feature Selection Strikes Back (Part 1)

In the feature selection chapter, we describe several search procedures (“wrappers”) that can be used to optimize the number of predictors. Some techniques were described in…

Benchmarking Machine Learning Models Using Simulation

What is the objective of most data analysis? One way I think about it is that we are trying to discover or approximate what is really going on in our data (and in general…

What the Market Will Bear

I’m not sure what the third one is about, but save your money…

Reproducible Research at ENAR

I gave a talk at the Spring ENAR meetings this morning on some of the technical aspects of creating the book. The session was on reproducible research and the slides are here.

Confidence in Prediction

A few colleagues have just published a paper on measuring the confidence in prediction in regression models (“Interpretable, Probability-Based Confidence Metric for…

What Was Left Out

There were a few topics that just couldn’t be added to the book due to time and, especially, space. For a project like this, the old saying is “you’re never done, you just…