Conditional PDF Naming in R: A Step-by-Step Guide to Saving Files Based on IDs

Conditional PDF Naming in R: A Step-by-Step Guide

As a data analyst or researcher, you may often find yourself working with large datasets and need to process them into various formats. One such task is saving PDFs from websites for further analysis or study. In this article, we will explore how to conditionally name PDFs saved in a loop using R.

Background: Working with PDFs in R

R provides several packages that can be used to work with PDF files, including the readPDF package for reading and writing PDFs. However, when it comes to downloading and saving PDFs from websites, you need to use the download.file() function in conjunction with other functions like file.exists(). In this article, we will explore how to conditionally name these saved PDFs based on their IDs.

Problem Statement

The problem at hand is that when you download multiple PDF files using a loop, R assigns a filename based on the URL of the PDF. This can be confusing and impractical when dealing with large datasets, as you need to keep track of the individual file names corresponding to each ID.

For example, let’s consider our head_data dataset:

url_pdf                                                                                           id       
  <chr>                                                                                             <chr>    
1 https://projekter.aau.dk/projekter/files/415527824/Speciale_2021.pdf                              413008245
2 https://projekter.aau.dk/projekter/files/415526224/FARDIG_SPECIALE_2_0__PDF.pdf                   413011720
3 https://projekter.aau.dk/projekter/files/437213254/Den_Almene_Udfordring.pdf                      432811291
4 https://projekter.aau.dk/projekter/files/415460040/Speciale_sociologi_NannaDyhr_MaleneScholer.pdf 413009078
5 https://projekter.aau.dk/projekter/files/420992851/Katrine_B._Dethlefsen__speciale_F2021.pdf      413050128
6 https://projekter.aau.dk/projekter/files/407804447/Speciale_2021_Katrine_May_Duus.pdf             405494084

Using the original code:

for (url in head_data$url_pdf){ download.file(url, destfile = basename(url), mode = "wb") }

We would end up with filenames like https://projekter.aau.dk/projekter/files/415527824/Speciale_2021.pdf, which is not practical.

Solution: Conditional PDF Naming

To conditionally name the saved PDFs, we need to create a filename that corresponds to the ID of each document. We can achieve this using the paste0() function in combination with the head_data dataset’s $id column.

Here is an example code snippet:

# Create filename, including .pdf and subfolder (pdfs/)
filename <- paste0("pdfs/", head_data[i,]$id, ".pdf")

# Check if file already exists, if not download and save
if (!file.exists(filename)) {
  download.file(url, destfile = filename, mode = "wb")
}

In this code:

  1. We create a filename variable that combines the subfolder name "pdfs/", the ID from the head_data dataset (i, $id), and the .pdf extension using paste0().
  2. We check if the file already exists at the specified location using file.exists(). If it does not exist, we download and save the PDF.

Example Use Case

Let’s use our example dataset to demonstrate how this code works:

head_data <- structure(list(url_pdf = c("https://projekter.aau.dk/projekter/files/415527824/Speciale_2021.pdf",
                                       "https://projekter.aau.dk/projekter/files/415526224/FARDIG_SPECIALE_2_0__PDF.pdf",
                                       "https://projekter.aau.dk/projekter/files/437213254/Den_Almene_Udfordring.pdf",
                                       "https://projekter.aau.dk/projekter/files/415460040/Speciale_sociologi_NannaDyhr_MaleneScholer.pdf",
                                       "https://projekter.aau.dk/projekter/files/420992851/Katrine_B._Dethlefsen__speciale_F2021.pdf",
                                       "https://projekter.aau.dk/projekter/files/407804447/Speciale_2021_Katrine_May_Duus.pdf"),
                  id = c("413008245", "413011720", "432811291", "413009078", 
                         "413050128", "405494084")), row.names = c(NA, -6L), class = c("tbl_df", 
                                                                                      "tbl", "data.frame"))

# Create subfolder 'pdfs' if it does not exist
dir.create("pdfs")

for (i in 1:nrow(head_data)) {
  # Retrieve URL from head_data
  url <- head_data[i,]$url_pdf
  
  # Create filename, including .pdf and subfolder (pdfs/)
  filename <- paste0("pdfs/", head_data[i,]$id, ".pdf")
  
  # Check if file already exists, if not download and save
  if (!file.exists(filename)) {
    download.file(url, destfile = filename, mode = "wb")
  }
}

In this example, we create a subfolder pdfs and then use the loop to download each PDF file from the head_data dataset. The filename variable is used to create the correct file name for each PDF file.

Conclusion

Conditional naming of PDF files in R can be achieved using a simple combination of string manipulation functions like paste0() and conditional statements. By understanding how to work with strings, filenames, and file existence checks, you can efficiently save multiple PDF files from websites while maintaining a consistent naming convention based on their IDs.

This article has provided an overview of the process and included examples to illustrate the steps involved in this task. With these techniques, you should be able to effectively manage your PDF files and maintain organizational consistency when working with large datasets in R.


Last modified on 2025-04-22