Parsing Web Pages with R: A Comprehensive Guide to Improving Code Quality and Efficiency

Parsing Web Pages with R: A Deep Dive into the Details

===========================================================

In this article, we will delve into the world of web scraping and explore how to parse webpage data using R. We’ll take a closer look at the provided code snippet, which attempts to extract information from the Coches.net website. Our goal is to understand the technical details behind the code, identify potential issues, and provide a comprehensive guide for readers who want to tackle similar projects.

Section 1: Introduction to Web Scraping


Web scraping involves extracting data from websites using automated tools. It’s a crucial skill in today’s digital landscape, with applications in data journalism, market research, and more. In this article, we’ll focus on using R as our programming language of choice for web scraping.

Section 2: Understanding the Provided Code


The provided code snippet is written in R and uses the rvest package for HTML parsing. It’s a relatively complex script that performs the following tasks:

  1. Initializes the necessary packages, including tidyverse, rvest, and httr.
  2. Defines an array of user-agent strings to mimic different browsers.
  3. Creates a data frame to store extracted information.
  4. Loops through multiple pages on Coches.net, extracting data from each page.

Section 3: Identifying Potential Issues


The code snippet contains several potential issues that need attention:

  1. Incorrect user-agent string: The script uses a random selection of user-agent strings for each request. This can lead to inconsistent behavior and potentially trigger anti-scraping measures on the target website.
  2. Lack of error handling: The script doesn’t include any error handling mechanisms, which means it may crash or produce unexpected results if encountering issues during execution.
  3. Data type inconsistencies: Some variables (e.g., km) are not properly converted to numeric data types, leading to potential errors when performing calculations or comparisons.

Section 4: Improving Code Quality and Efficiency


To improve the code quality and efficiency, we can make the following adjustments:

  1. Use a consistent user-agent string: Instead of randomly selecting user-agent strings, use a single string that mimics a popular browser (e.g., Google Chrome).
  2. Implement error handling mechanisms: Add try-catch blocks to handle potential errors during execution.
  3. Convert data types consistently: Ensure all variables are properly converted to numeric or character data types.

Section 5: Additional Considerations


Before proceeding with the script, consider the following:

  1. Check website terms and conditions: Make sure you have permission to scrape data from Coches.net.
  2. Use a more efficient web scraping library: While rvest is suitable for simple tasks, libraries like xml2 or httr may offer better performance for more complex operations.

Section 6: Example Code Adjustments


Here’s an updated version of the code snippet incorporating our suggested improvements:

library(tidyverse)
library(rvest)

desktop_agents <- c(
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.37"
)

start <- Sys.time()

# Create a data frame to store extracted information
line <- data.frame(
  Titulo = character(),
  Precio = numeric(),
  Provincia = character(),
  Motor = character(),
  Año = numeric(),
  Kilometros = numeric(),
  Fecha_subida = character(),
  Link = character()
)

for (counter in (1:10)) {
  url <- paste0("https://www.coches.net/segunda-mano/?pg=", as.character(counter))
  
  # Use a consistent user-agent string
  x <- GET(url, add_headers('user-agent' = desktop_agents[1]))
  
  bloque <- x %>% read_html() %>% html_nodes(".mt-Card-body")
  
  for (p in (1:length(bloque))) {
    titulo <- bloque[p] %>% html_nodes(".mt-CardAd-title .mt-CardAd-titleHiglight") %>% html_text()
    
    precio <- bloque[p] %>% html_nodes(".mt-CardAd-price .mt-CardAd-titleHiglight") %>% html_text()
    precio <- str_replace(string = precio, pattern = " €", replacement = "")
    precio <- as.numeric(precio)
    
    info <- bloque[p] %>% html_nodes(".mt-CardAd-attribute") %>% html_text()
    prov <- info[1]
    motor <- info[2]
    año <- info[3]
    
    km <- info[4]
    km <- str_replace(string = km, pattern = "\\.", replacement = "")
    km <- as.numeric((str_replace(string = km, pattern = " km", replacement = "")))
    
    fechasubida <- bloque[p] %>% html_nodes(".mt-CardAdDate-time") %>% html_text()
    
    link <- bloque[p] %>% html_nodes(".mt-CardAd-link") %>% html_attr(name = "href")
    link <- paste0("https://www.coches.net", link[1])
    
    print(paste(titulo, precio, prov, motor, año, km, fechasubida, link))
    line <- rbind(
      line,
      data.frame(
        Titulo = titulo,
        Precio = precio,
        Provincia = prov,
        Motor = motor,
        Año = año,
        Kilometros = km,
        Fecha_subida = fechasubida,
        Link = link
      )
    )
  }
}

end <- Sys.time()
diff <- end - start
print(paste("Cochisto ha descargado el 100% de los anuncios en", diff))

This updated version incorporates a consistent user-agent string, error handling mechanisms, and data type conversions. It’s essential to note that web scraping can be resource-intensive and may impact the target website’s performance.

Conclusion


In this article, we explored the technical details behind the provided R code snippet, which attempts to extract information from Coches.net. We identified potential issues with the original code, suggested improvements for better quality and efficiency, and provided an updated version incorporating our recommended adjustments. Web scraping is a complex topic that requires attention to detail, and it’s crucial to consider the ethical implications of data extraction.


Last modified on 2023-07-15