Applied Predictive Modeling Blog
  • Home
  • Book
  • Computing: R
  • About

This is a continuation of original blog that we made for our previous book. New posts, as of 2024, can be found here. The posts here are not associated with our employers; opinions are our own.

Other places that we create content: the Tidyverse blog and tidymodels.org’s Learn pages.



Progress Update (September 2024)

Three new chapters!
2024-09-23
Max Kuhn

Progress Update (June 2024)

We just released a new set of chapters:
2024-06-18
Max Kuhn

Data Usage with Postprocessing

This document is used to discuss and test ideas for how have can estimate and evaluate machine learning models that have three potential components:
2024-05-07
Max Kuhn, Simon Couch

Post Hoc Nearest Neighbors Prediction Adjustments

Quinlan (1993) describes a post-processing technique used for numeric predictions that adjusts them using information from the training set.
2024-04-10
Max Kuhn

February 2024 Talks

One keynote and one tutorial last month (both new).
2024-04-09
Max Kuhn

Predictive Survival Analysis

Predictive survival models come to tidymodels.
2024-04-07
Max Kuhn

Two New Preprocessing Chapters

We just released two new chapters: “Transforming Numeric Predictors” and “Working with Categorical Predictors.”
2024-03-18
Max Kuhn

2024 Tidymodels User Survey

Tell us which features are most important to you.
2024-03-04
Max Kuhn

WTF Article

Kjell and I have a new paper called “What They Forgot to Tell You about Machine Learning with an Application to Pharmaceutical Manufacturing.”
2024-03-01
Max Kuhn

Progress Update (February 2024)

Since the last update on 2023-11-20, we have a few new sections and chapters.
2024-02-27
Max Kuhn

New Location, Same Content

This is the new home for the Applied Predictive Modeling blog.
2024-02-26
Max Kuhn

2022 tidymodels user survey

We are conducting another survey to see where users would like us to spend our development time.
2021-10-07
Max Kuhn

tidymodels updates and voting!

While I’m still supporting caret, the majority of my development effort has gone into the tidyverse modeling packages (called tidymodels).
2020-04-27
Max Kuhn

Slides from R/Pharma

My slides from the R/Pharma conference on “Modeling in the Tidyverse” are in pdf format as well as the HTML version.
2018-08-16
Max Kuhn

R/Medicine conference

I’ll be giving a talk at the R/Medicine conference on Sept 7th in New Haven CT.
2018-08-15
Max Kuhn

Podcast on Nonclinical Statistics

Hugo Bowne-Anderson and I spoke about about data science in pharmaceuticals, the tidyverse, and more for the excellent DataFramed podcast from DataCamp. Listen to it here or…
2018-06-30
Max Kuhn

Early draft of our “Feature Engineering and Selection” book

Kjell and I are writing another book on predictive modeling, this time focused on all the things that you can do with predictors. It’s about 60% done and we’d love to get…
2018-05-14
Max Kuhn

tidyposterior slides

tidyposterior is an R package for comparing models based on their resampling statistics. There are a few case studies on the webpage to illustrate the process.
2018-05-04
Max Kuhn

New Workshop in Washington DC (August)

I’ll be conducting a workshop called “Applied Machine Learning” in Washington DC on August 15 and 16. The last one, at the RStudio conference, sold out quickly.
2018-04-10
Max Kuhn

Tidy Resampling Redux with Agricultural Economics Data

(No statistical graphs in this one. This is what my dog Artemis looks like when she wants my attention during work hours.)
2018-03-12
Max Kuhn

RStudio 2018 Conference Presentation and Materials

We’ve released our videos of the talks at the 2018 RStudio conference. My talk was Modeling in the Tidyverse (video) and I was also in the Tidyverse fireside chat (video).…
2018-03-04
Max Kuhn

While you wait for that to finish, can I interest you in parallel processing?

caret has been able to utilize parallel processing for some time (before it was on CRAN in October 2007) using slightly different versions of the package. Around September…
2018-01-17
Max Kuhn

Lots of Package News

I’ve sent a slew of packages to CRAN recently (thanks to Swetlana and Uwe). There are updates to:
2017-12-11
Max Kuhn

caret Cheatsheet

It can be found on the RStudio cheatsheet page. Suggestions and pull requests are always welcome.
2017-09-12
Max Kuhn

Nested Resampling with rsample

A typical scheme for splitting the data when developing a predictive model is to create an initial split of the data into a training and test set. If resampling is used, it…
2017-09-04
Max Kuhn

 

Nonclinical Statistics Position in New England

I try to limit postings about jobs here, there is an interesting position in pharma for a statistician in New England.
2017-07-27
Max Kuhn

Do Resampling Estimates Have Low Correlation to the Truth? The Answer May Shock You.

One criticism that is often leveled against using resampling methods (such as cross-validation) to measure model performance is that there is no correlation between the CV…
2017-04-24
Max Kuhn

 

caret package plans

A few people have asked if anything is going to happen to caret now that I’m working at RStudio.
2017-02-02
Max Kuhn

Working at RStudio

I’ve joined Hadley’s team at RStudio.
2016-11-28
Max Kuhn

2016 UK Tour

I’ll be in the UK next week doing three talks in three days:
2016-09-26
Max Kuhn

 

DataCamp Course [UPDATE]

Zachary Deane-Mayer, who collaborates on caret, has put together a DataCamp course on Machine Learning in R.
2016-09-26
Max Kuhn

Boston R User Group Talk [UPDATE]

I’ll be giving a talk on Boston R user Group on Thursday March 10th at 6:00 PM. The talk will be on rule-based regression models.
2016-03-04
Max Kuhn

Nonclinical Statistics Book

Springer has a new book (Amazon) edited by Lanju Zhang that captures the breadth of problems for statistics in the pharmaceutical industry including: compound optimization…
2016-02-08
Max Kuhn

Central Iowa R User Group Talk [Updated]

I’ll be giving a talk (“Applied Predictive Modeling”) to the Central Iowa R User Group on Thursday night at 6:00 PM to 8:00 PM (CST).
2016-01-18
Max Kuhn

Nonclinical Statistician Position at Pfizer

The Research Statistics group collaborates across a wide variety of activities in the early phases of drug discovery. This position is located in Groton CT and has a focus…
2015-12-14
Max Kuhn

In Search Of…

Rafael Ladeira asked on github:
2015-12-13
Max Kuhn

C5.0 Class Probability Shrinkage

(The image above has nothing do to with this post. It does, however, show the prize that my son won during a recent vacation to Virginia and how I got it back home).
2015-09-14
Max Kuhn

The 2014 Ziegel Award

Last night in Seattle, Kjell and I were awarded the American Statistical Association’s Ziegel Award for the best book reviewed in Technometrics during 2014. Technometrics is…
2015-08-12
Max Kuhn

Feature Engineering versus Feature Extraction: Game On!

“Feature engineering” is a fancy term for making sure that your predictors are encoded in the model in a manner that makes it as easy as possible for the model to achieve…
2015-08-03
Max Kuhn

 

New caret Version (6.0-52)

A new version of caret (6.0-52) is on CRAN.
2015-07-22
Max Kuhn

Slides from recent talks

I’ve been buried in work lately but thought I’d share the slides from two recent talks. The first is from the Bay Area RUG. Since someone filmed the talks, I was waiting to…
2015-04-21
Max Kuhn

A Talk and Course in NYC Next Week

I’ll be giving talk on Tuesday February 17 (7:00PM-9:00PM) that will be an overview of predictive modeling. It will not be highly technical and here is the current outline:
2015-03-13
Max Kuhn

Simulated Annealing Feature Selection

As previously mentioned, caret has two new feature selection routines based on genetic algorithms (GA) and simulated annealing (SA). The help pages for the two new functions…
2015-01-12
Max Kuhn

Regression Solutions Available

The github page for the APM exercises has been updated with three new files for Chapters 6-8 (the section on regression).
2015-01-08
Max Kuhn

New Version of caret on CRAN

A new version of caret is on CRAN.
2015-01-05
Max Kuhn

 

My Research Tools

Pfizer has an excellent group of librarians and they recently contacted people, including a few statisticians, about how we find and organize article. I’ve spent…
2014-12-15
Max Kuhn

Comparing the Bootstrap and Cross-Validation

This is the second of two posts about the performance characteristics of resampling methods. The first post focused on the cross-validation techniques and this post mostly…
2014-12-08
Max Kuhn

Comparing Different Species of Cross-Validation

This is the first of two posts about the performance characteristics of resampling methods. I just had major shoulder surgery, but I’ve pre-seeded a few blog posts. More…
2014-12-02
Max Kuhn

 

Solutions on github

See this page. We’re not done with them all but chapter 3 and 4 are there and the regression chapters are not too far behind.
2014-11-12
Max Kuhn

 

Some Thoughts on “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?”

Sorry for the blogging break. I’ve got a few planned for the next few weeks based on some work I’ve been doing.
2014-11-11
Max Kuhn

 

Exercise Solutions

I’m finally recovering form the summer and will start posing again soon.
2014-10-01
Max Kuhn

 

useR! 2014 Highlights

(This article was originally posted at http://appliedpredictivemodeling.com)
2014-07-03
Max Kuhn

New caret version with adaptive resampling

A new version of caret is on CRAN now.
2014-05-28
Max Kuhn

 

A Tutorial and Talk at useR! 2014 [Important Update]

See the update below
2014-05-07
Max Kuhn

 

Cross-validation pitfalls when selecting and assessing regression and classification models

Damjan Krstajic and friends have a great paper on pitfalls of cross-validation. Although the paper uses chemistry data, the meat of the article is broadly applicable. It…
2014-04-10
Max Kuhn

 

Bay Area RUG Talk on 3/17 (updated)

I’m making my yearly pilgrimage to San Fransico to teach at PAW.
2014-03-09
Max Kuhn

 

caret webinar materials

The webinar was recorded (thanks to Ray DiGiacomo and the Orange County RUG). The slides are here minus a few typos.
2014-02-28
Max Kuhn

Sample Mislabeling and Boosted Trees

A while back, I saw this post on StackExchange/Crossvalidated: “Does anyone know how well C5.0 boosting performs in the presence of mislabeled data?” I did some simulations…
2014-02-18
Max Kuhn

Optimizing Probability Thresholds for Class Imbalances

One of the toughest problems in predictive model occurs when the classes have a severe imbalance. We spend an entire chapter on this subject itself. One consequence of this…
2014-02-06
Max Kuhn

 

caret webinar on Feb 25

I”ll be doing a webinar with the Orange County R User Group on the caret package on Tue, Feb 25, 2014 1:00 PM - 2:00 PM EST.
2014-02-02
Max Kuhn

Calibration Affirmation

In the book, we discuss the notion of a probability model being “well calibrated”. There are many different mathematical techniques that classification models use to produce…
`2014-01-04`{=html}
Max Kuhn

Down-Sampling Using Random Forests

We discuss dealing with large class imbalances in Chapter 16. One approach is to sample the training set to coerce a more balanced class distribution. We discuss
2013-12-08
Max Kuhn

 

ASA Talk in New York This Thursday (11/14)

I’ll be giving a talk on predictive modeling for the American Statistical Association next Thursday (the 14th) :
2013-11-09
Max Kuhn

The Basics of Encoding Categorical Data for Predictive Models

Thomas Yokota asked a very straight-forward question about encodings for categorical predictors: “Is it bad to feed it non-numerical data such as factors?” As usual, I will…
2013-10-23
Max Kuhn

 

Exercises and Solutions

Kjell and I are putting together solutions to the exercise sections. It is a lot of work, so it may take some time.
2013-08-30
Max Kuhn

 

Availability

After being on backorder for about 10 weeks, we are told that a larger batch of books have been printed and should be available shortly (despite the Amazon page’s note about…
2013-08-16
Max Kuhn

Equivocal Zones

In Chapter 11, equivocal zones were briefly discussed. The idea is that some classification errors are close to the probability boundary (i.e. 50% for two class outcomes).…
2013-08-16
Max Kuhn

 

UseR! Slides for “Classification Using C5.0”

I’ve had a lot of requests, so here they are. Hopefully, all of the slides will be posted on the conference website.
2013-07-17
Max Kuhn

 

UseR! 2013 Highlights

The conference was excellent this year. My highlights:
2013-07-13
Max Kuhn

Measuring Associations

In Chapter 18, we discuss a relatively new method for measuring predictor importance called the maximal information coefficient (MIC). The original paper is by Reshef at al…
2013-06-21
Max Kuhn

type = “what”?

One great thing about R is that has a wide diversity of packages written by many different people of many different viewpoints on how software should be designed. However…
2013-06-13
Max Kuhn

Feature Selection 3 - Swarm Mentality

“Bees don’t swarm in a mango grove for nothing. Where can you see a wisp of smoke without a fire?” - Hla Stavhana
2013-06-06
Max Kuhn

 

33 Months Later

After starting in September 2010, the book is now available from the publisher. The Kindle edition is available and Amazon will ship hard copies around July 1st.
2013-05-29
Max Kuhn

 

One Statistician’s View of Big Data

Recently I’ve had several questions about using machine learning models with large data sets. Here is a talk I gave at Yale’s Big Data Symposium on the subject.
2013-05-20
Max Kuhn

 

Recent Changes to caret

Here is a summary of some recent changes to caret.
2013-05-18
Max Kuhn

 

Projection Pursuit Classification Trees

I’ve been looking at this article for a new tree-based method. It uses other classification methods (e.g. LDA) to find a single variable use in the split and builds a tree…
2013-05-14
Max Kuhn

Feature Selection 2 - Genetic Boogaloo

Previously, I talked about genetic algorithms (GA) for feature selection and illustrated the algorithm using a modified version of the GA R package and simulated data. The…
2013-05-08
Max Kuhn

Feature Selection Strikes Back (Part 1)

In the feature selection chapter, we describe several search procedures (“wrappers”) that can be used to optimize the number of predictors. Some techniques were described in…
2013-04-29
Max Kuhn

Benchmarking Machine Learning Models Using Simulation

What is the objective of most data analysis? One way I think about it is that we are trying to discover or approximate what is really going on in our data (and in general…
2013-04-13
Max Kuhn

What the Market Will Bear

I’m not sure what the third one is about, but save your money…
2013-03-29
Max Kuhn

 

Reproducible Research at ENAR

I gave a talk at the Spring ENAR meetings this morning on some of the technical aspects of creating the book. The session was on reproducible research and the slides are here.
2013-03-11
Max Kuhn

 

Confidence in Prediction

A few colleagues have just published a paper on measuring the confidence in prediction in regression models (“Interpretable, Probability-Based Confidence Metric for…
2013-02-12
Max Kuhn

 

What Was Left Out

There were a few topics that just couldn’t be added to the book due to time and, especially, space. For a project like this, the old saying is “you’re never done, you just…
2013-02-10
Max Kuhn
No matching items