Why we use IRT at DataCamp

Beta

Why we use IRT at DataCamp

install.packages("mirt")
install.packages("mirtCAT")
install.packages("ggthemes")
remotes::install_github("datacamp/tidymirt", upgrade = "never")

library(tidyverse)
library(mirt)
library(mirtCAT)
library(tidymirt)
library(ggdc)
library(patchwork)
library(ggthemes)

Hidden output

Introduction

Item response theory is a framework to put items and people on the same scale. The core assumption is that an item has an intrinsic difficulty and a person an intrinsic ability to respond correctly to an item. This is formalized in the Rasch (or one-parameter logistic model) as follows:

This equation simply says that the probability to respond correctly to an item is determined by the difference between a person's ability and an item's difficulty. If they match, there will be a 50% chance to respond correctly to an item. If they mismatch strongly, then the probability will eiter converge to 0 or 1 depending on the sign of the mismatch. The Rasch model is often visualised through an item characteric curve.

Figure 1

This figure shows the item characteristic curve for a single item with a difficulty level of 0. It relates the expected probability of giving a correct response to the true ability level of a person. By definition, when item difficulty coincides with ability, there's a 50% chance to give a correct response. As soon as your ability exceeds difficulty, the probability increases to 1 whereas it decreases to 0 when your ability is lower than the item difficulty.

# create data set 
df <- tibble(beta = 0, 
             theta = seq(-4, 4, l = 1000))

# use IRT model to calculate probability correct 
df$prob_correct <- exp(df$theta - df$beta) / (1 + exp(df$theta - df$beta))

# visualise item characteristic curve
df %>% 
	ggplot(aes(x = theta, y = prob_correct)) + 
	geom_line() + 
	xlab("Ability") + 
	ylab("Probability correct") + 
  	ggtitle("Item characteristic curve for a difficulty level of 0")

Figure 2

This figure shows a set of item characteristic curves for a test containing 7 items. This kind of chart provides some intuition how a certain ability level would yield an expected performance level on a test. For example, for high ability levels, the probability of correctly responding is .75 or higher, implying that there is a high chance that a person would get all items correct.

# create data set 
df <- expand_grid(beta = seq(-3, 3, l = 7), 
             theta = seq(-4, 4, l = 1000))

# use IRT model to calculate probability correct 
df$prob_correct <- exp(df$theta - df$beta) / (1 + exp(df$theta - df$beta))

# visualise item characteristic curves
df %>% 
	ggplot(aes(x = theta, y = prob_correct, group = beta, color = as.factor(beta))) + 
	geom_line() + 
	xlab("Ability") + 
	ylab("Probability correct") + 
	scale_color_colorblind(name = "Item difficulty") +
  	ggtitle("Item characteristic curves for set of items")

# total number of items and users to simulate from
n_items <- 61
n_users <- 1000
  
# fix discrimination and guessing parameters to be compatible with simdata function
discr <- 1
guess <- 0

# put in data frame
df_pars <- data.frame(a1 = rep(discr, n_items), 
                      d = seq(-3, 3, l=n_items), 
                      g = rep(guess, n_items))

Figure 3

For this figure, we performed a simulation inspired by the test visualised in Figure 2. That is, we expanded the full set of items to 61 and took either the 30 easiest items, or the 30 hardest items to simulate data from. User ability was always set to 0, and data for 1000 users was simulated. Thus, the simulated data consists, for each users, of either a 1 (correct response) or 0 (incorrect response) for each item that was part of the test.

Figure 3 shows the distributions of the total number of correct responses for each test. It is immediately clear that the easy test on average yielded much more correct responses compared to the hard test. Keep in mind that each user has the same true ability level!

# simulate data 
sim_data <- simdata(a = rep(discr, times = n_items), 
                    d= matrix(seq(-3, 3, l = n_items)), 
                    N = n_users, 
                    itemtype = "dich", 
                    Theta = rep(0, n_users))

# calculate number of correct responses for "easy" and "hard" tests 
nc_hard <- apply(sim_data[, 1:(floor(n_items/2))], 1, sum)
nc_easy <- apply(sim_data[, (ceiling(n_items/2)+1):n_items], 1, sum)

# put in tibble
df_nc <- tibble(nc = c(nc_hard, nc_easy), 
                source = rep(c("hard", "easy"), each = n_users))

# create visualisation of distribution of number of correct responses for both tests 
df_nc %>% 
	ggplot(aes(x = nc, color = source, fill = source)) + 
	geom_density(alpha = .2) + 		
	geom_vline(xintercept = mean(nc_easy), linetype = "dashed", color = "black") +
	geom_vline(xintercept = mean(nc_hard), linetype = "dashed", color = "#E69F00") +
	scale_fill_colorblind(name = "Test difficulty") +
  	scale_color_colorblind(name = "Test difficulty") +
	xlab("Number of correct responses") +
	ylab("Density") +
	ggtitle("Distribution of number of correct responses",
		    subtitle = "Broken up by difficult or easy test")

Figure 4

In this figure, we then look at the distributions of estimated ability levels for the same easy and hard tests. The difference with Figure 3 is immediately obvious. Now the distributions largely overlap, and are centered on the true ability level that was used to simulate data from. This is where IRT shines. Accounting for item difficulty gives a better estimate of the ability of a person.

# generate mirt objects from true parameters 
mod_hard <- mirtCAT::generate.mirt_object(parameters = df_pars[1:(floor(n_items/2)),], itemtype = "2PL")
mod_easy <- mirtCAT::generate.mirt_object(parameters = df_pars[(ceiling(n_items/2)+1):n_items,], itemtype = "2PL")

# estimate ability scores based on response pattern on easy and hard tests 
ability_hard <- fscores(mod_hard, 
						response.pattern = sim_data[, 1:(floor(n_items/2))], 
						method = 'EAP')[,1]
ability_easy <- fscores(mod_easy, 
						response.pattern = sim_data[, (ceiling(n_items/2)+1):n_items], 
						method = "EAP")[,1]

# put together in tibble
df_estimate <- tibble(ability = c(ability_hard, ability_easy), 
                          source = rep(c("hard", "easy"), each = n_users))

# create visualisation of distribution of estimated ability scores
df_estimate %>% 
	ggplot(aes(x = ability, color = source, fill = source)) + 
	geom_vline(xintercept = mean(ability_hard), linetype = "dashed", color = "black") +
	geom_vline(xintercept = mean(ability_easy), linetype = "dashed", color = "#E69F00") +
	scale_fill_colorblind(name = "Test difficulty") +
  	scale_color_colorblind(name = "Test difficulty") +
  	geom_density(alpha = .2) + 
	xlab("Estimated ability") +
	ylab("Density") +
	ggtitle("Distribution of estimated abilities",
		    subtitle = "Broken up by difficult or easy test")

Why we use IRT at DataCamp

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Why we use IRT at DataCamp

Introduction

Figure 1

Figure 2

Figure 3

Figure 4

Why we use IRT at DataCamp