DirtyCred Remastered: how to turn an UAF into Privilege Escalation

Background

A few weeks ago, @kiks and I started to search for some recent CVEs in order to pratice our kernel exploitation skills. We chose CVE-2022-2602 as our target for two reasons:

In the end, we were able to create a functional exploit using two different techniques: userfaultfd and inode locking. FUSE exploit coming soon, I’ll update this blog post :)

Go checking out @kiks’ blogpost about the same vulnerability here :)

I’ll start this blogpost by explaining some concepts that are useful in order to understand the vuln.

File Descriptors

Briefly, a file descriptor is just a number, used by processes in user-space to refer to open resources.

It could be a file, a UNIX socket, a network socket, an userfault handler, a message queue…ANYTHING!

Each process starts with three file descriptors:

They represent the three standard communication channels of a process. This number is used as an index in the open file table of the OS, which keeps track of all resources opened by processes on the system.

Openfiletable

More info here

Reference Counting

Each file descriptor has a reference count number associated with it.

This is used by the OS to check when a file descriptor is no longer referenced and then the memory associated with it can be freed.

For example, a call to dup2() will increment the reference count of a file descriptor. Instead, a close() will decrement the reference count.

int main(int argc, char **argv){

	printf("Right now I'm using fd 1 to print this out!");
	dup2(1,99);
	// Now stdout has two references
	// Let's close fd 1 --> stdout's ref is decremented
	close(1);
	// Stdout is still alive because ref = 1
	write(99, "Hello from fd 99!\n", 20);
	exit(0);
}

Things can be more complicated if a file descriptor is sent from one process to another. I won’t go into details here, Google Project Zero has a great blogpost about the linux garbage collector, I consider it a must read :) It also explains how reference counting cycles are handled, which is important in order to understand the root cause of the vuln.

open: follow the white rabbit into the kernel

Let’s talk a little bit more about file fds.

What happens in the kernel when an open() system call is issued? Let’s take a look into kernel source code:

SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
	if (force_o_largefile())
		flags |= O_LARGEFILE;
	return do_sys_open(AT_FDCWD, filename, flags, mode);
}

open indirectly calls do_sys_openat2:

static long do_sys_openat2(int dfd, const char __user *filename,
			   struct open_how *how)
{
	struct open_flags op;
	int fd = build_open_flags(how, &op);
	struct filename *tmp;

	if (fd)
		return fd;

	tmp = getname(filename);
	if (IS_ERR(tmp))
		return PTR_ERR(tmp);

	fd = get_unused_fd_flags(how->flags);
	if (fd >= 0) {
		struct file *f = do_filp_open(dfd, tmp, &op);
		if (IS_ERR(f)) {
			put_unused_fd(fd);
			fd = PTR_ERR(f);
		} else {
			fsnotify_open(f);
			fd_install(fd, f);
		}
	}
	putname(tmp);
	return fd;
}

The do_filp_open function will call path_openat, which will eventually call kmem_cache_zalloc() to allocate an empty file:

static struct file *__alloc_file(int flags, const struct cred *cred)
{
	struct file *f;
	int error;

	f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
	if (unlikely(!f))
		return ERR_PTR(-ENOMEM);

	f->f_cred = get_cred(cred);
	error = security_file_alloc(f);
	if (unlikely(error)) {
		file_free_rcu(&f->f_u.fu_rcuhead);
		return ERR_PTR(error);
	}

	atomic_long_set(&f->f_count, 1);
	rwlock_init(&f->f_owner.lock);
	spin_lock_init(&f->f_lock);
	mutex_init(&f->f_pos_lock);
	f->f_flags = flags;
	f->f_mode = OPEN_FMODE(flags);
	/* f->f_version: 0 */

	return f;
}

struct file

kmem_cache_zalloc returns a struct file struct. This is the low-level representation of an opened file:

struct file {
	union {
		struct llist_node	fu_llist;
		struct rcu_head 	fu_rcuhead;
	} f_u;
	struct path		f_path;
	struct inode		*f_inode;	/* cached value */
	const struct file_operations	*f_op;

	/*
	 * Protects f_ep, f_flags.
	 * Must not be taken from IRQ context.
	 */
	spinlock_t		f_lock;
	enum rw_hint		f_write_hint;
	atomic_long_t		f_count;
	unsigned int 		f_flags;
	fmode_t			f_mode;
	struct mutex		f_pos_lock;
	loff_t			f_pos;
	struct fown_struct	f_owner;
	const struct cred	*f_cred;
	struct file_ra_state	f_ra;

	u64			f_version;
#ifdef CONFIG_SECURITY
	void			*f_security;
#endif
	/* needed for tty driver, and maybe others */
	void			*private_data;

#ifdef CONFIG_EPOLL
	/* Used by fs/eventpoll.c to link all the hooks to this file */
	struct hlist_head	*f_ep;
#endif /* #ifdef CONFIG_EPOLL */
	struct address_space	*f_mapping;
	errseq_t		f_wb_err;
	errseq_t		f_sb_err; /* for syncfs */
} __randomize_layout
  __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */

This struct is allocated in a so-called cache, in this case the filp cache.

In case you don’t know what the hell I’m talking about, don’t worry, I got you covered. This video explains in great details the kernel memory allocator. If you speak Italian, I suggest you watching these videos about kernel memory allocator.

io_uring

io_uring is a linux subsystem for asynchronus I/O.

IORING

If you want to know more about IO_URING, I recommend you reading @chompie’s blogpost.

Root Cause Analysis

The first thing we did, was to try the public PoC available, on a vulnerable kernel with KASAN enabled.

[   35.652903] ==================================================================
[   35.655540] BUG: KASAN: use-after-free in __io_queue_sqe+0x20f/0x4d0
[   35.655540] Read of size 8 at addr ffff8880086c8528 by task iou-sqp-147/149

KASAN report confirms that we are dealing with an use-after-free vulnerability in the io_uring subsystem.

[   35.655540] The buggy address belongs to the object at ffff8880086c8500
[   35.655540]  which belongs to the cache filp of size 232
[   35.655540] The buggy address is located 40 bytes inside of
[   35.655540]  232-byte region [ffff8880086c8500, ffff8880086c85e8)
[   35.655540] The buggy address belongs to the page:
[   35.655540] page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x86c8
[   35.655540] flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
[   35.655540] raw: 000fffffc0000200 0000000000000000 dead000000000122 ffff888005614b40
[   35.655540] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
[   35.655540] page dumped because: kasan: bad access detected

It gives us also information about the address: it’s inside the filp cache, so it involves file descriptors.

After analyzing the public PoC, we understood the root cause of the vulnerability:

The problem arises when a kernel thread, that is completing an OP_WRITEV request in io_uring using a registered file, is somehow paused, while that file is being freed in userland.

This is achieved by exploiting the behaviour of the unix_gc garbage collector, which will search for all socket buffers that are inside an unbreakable cycle and it kfree()s them. By inserting io_uring file descriptor inside an unbreakable cycle, it’s possible to trigger the kfree() on it.

There are different ways to pause a kernel thread, in the following sections I’ll describe some of them.

Exploitation plan - userfaultfd

One possibility is using userfaultfd, by issuing an OP_WRITEV request with an iovec that is allocated in zero-demand paging. In this situation, the first access to this chunk of memory will cause a page fault.

But, in the io_uring context, when the page fault is triggered? Let’s see the implementation of the io_write, which is the OP_WRITEV function handler:

static int io_write(struct io_kiocb *req, unsigned int issue_flags)
{
	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
	struct kiocb *kiocb = &req->rw.kiocb;
	struct iov_iter __iter, *iter = &__iter;
	struct io_async_rw *rw = req->async_data;
	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
	struct iov_iter_state __state, *state;
	ssize_t ret, ret2;

	[Truncated]
	
	ret = io_import_iovec(WRITE, req, &iovec, iter, !force_nonblock); //It will trigger the page fault

	[Truncated]
}

io_import_iovec will indirectly call copy_from_user(), which will import the iovec from user-space. If the iovec is allocated with zero-demand paging, the copy_from_user() will trigger the page fault: the kernel thread that is completing the write request will be blocked during the copy_from_user(), waiting for userland to finish the page fault handling. In userland, the page fault handler can free the registered file by invoking the unix_gc garbage collector, which will free the io_uring’s file descriptor and this will also trigger the kfree() of the struct file belonging to the file we are writing to.

When the kernel thread resumes its execution, it will write the content of the iovec inside a file, described by a freed struct file.

If we manage to replace the freed struct file with another one, the io_write will write the content to the replaced file.

Turning it into privilege escalation

Since the goal is to achieve privilege escalation, the first thing I thought was to replace the file with /etc/passwd, in order to write a new entry inside it. Yes, it’s possible to write inside /etc/passwd, because the read/write check is performed before the page fault handler is triggered!

If we open a file in O_RDWR mode and then substitute it with /etc/passwd, it’s possible to write a new entry inside the passwd file.

So the exploit workflow is:

  1. Open an io_uring file descriptor.
  2. Add a registered file to it.
  3. Insert io_uring fd inside an unbreakable cycle.
  4. Issue an io_write request, which will be paused using the page fault handler.
  5. Trigger the kfree() of the struct file of the registered file (triggering unix_gc garbage collector).
  6. Replace the freed file with /etc/passwd.
  7. Resume write operation.
  8. Root!

This is what happens inside the filp cache:

  1. Allocation of a struct file for a readable/writeable file:

Step1

  1. Triggering the free of the struct file:

Step2

  1. Replacing the freed struct file with /etc/passwd:

Step3

If we win the race condition, the content of the iovec is written inside /etc/passwd.

PoC, or it didn’t happen

The PoC adds a new entry inside /etc/passwd for a new root user called pwned with password lol.

PoC1

Exploitation plan - inode locking

Sadly, userfaultfd is not allowed anymore for unprivileged users on almost all linux distributions; so it won’t work :(

Fortunately, there are other techniques that allows to pause a kernel thread. One of these is exploiting the inode locking mechanism.

What is inode locking?

Inode locking is a mechanism used by filesystems, in our case the ext4 filesystem, to ensure that only one process at the time can write to a file.

This is done inside the write implementation of the filesystem:

static ssize_t ext4_buffered_write_iter(struct kiocb *iocb, struct iov_iter *from){
	ssize_t ret;
	struct inode *inode = file_inode(iocb->ki_filp);
	inode_lock(inode);

	[Truncated]

	ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
	
	[Truncated]

	inode_unlock(inode);
	return ret;
}

Before performing the write operation, the kernel thread tries to grab the lock on the inode.

More info here

It is possible to exploit that by creating another thread that keeps writing to the target file, and then issuing the OP_WRITEV request. In this situation, the kernel thread that is performing the OP_WRITEV request is blocked until the other thread has completed the write operation.

We can use this mechanism as a replacement for the userfaultfd technique. The rest of the exploit doesn’t change.

PoC, or it didn’t happen pt. 2

PoC2

Exploitation plan - FUSE filesystem

COMING SOON…

Conclusion

I had a lot of fun understanding how to exploit this vuln, and I learned a lot about linux internals :)

Huge thanks to @kiks. I had a lot of fun working with him on this project, he is a very skilled guy :)

Check out his blogpost on the same vulnerability: it goes in much more details on issues we had during the exploit developement phase, he also introduces the new KRWX feature we used during exploit developement.

I’ve uploaded the POCs here