Download PDF files frome this website "https://register.awmf.org/de/start" but the code didn't find any PDF Link, although there are links to PDF files, but indirectly,I want to download all available PDF files and organize them into a folder.
and this message appeared when applying the code: No PDF links found on the page.
this is the code:
library(rvest)
library(stringr)
library(downloader) # For downloading the actual files
1. Define the main URL of the page you want to scrape
main_page_url <- "https://register.awmf.org/de/start" # <--- REPLACE with your target page URL
2. Read the HTML content of the main page
cat("Reading HTML from:", main_page_url, "\n")
webpage <- tryCatch({
read_html(main_page_url)
}, error = function(e) {
cat("Error reading main page:", e$message, "\n")
return(NULL)
})
if (is.null(webpage)) {
stop("Could not access the main webpage. Exiting.")
}
3. Extract all links (href attributes of <a> tags)
You might need to inspect the webpage's HTML to find specific CSS selectors
if you only want links from a certain section (e.g., a div with class "downloads")
cat("Extracting links...\n")
all_links <- webpage %>%
html_elements("a") %>% # Select all <a> (anchor) tags
html_attr("href") # Extract the 'href' attribute from each <a> tag
Remove any NULL or NA links
all_links <- all_links[!is.na(all_links)]
4. Filter for specific file types (e.g., PDFs)
You can extend this regex to include .docx, .xlsx, .zip, etc.
pdf_links <- all_links[str_detect(all_links, "\.pdf$")] # Links ending with .pdf
5. Construct full URLs (handle relative links)
This is crucial! Many links are relative (e.g., /documents/file.pdf)
base_url <- "https://example.com" # <--- REPLACE with the base URL of the website
full_pdf_urls <- sapply(pdf_links, function(link) {
if (str_starts(link, "http://") || str_starts(link, "https://") || str_starts(link, "ftp://")) {
return(link) # It's already an absolute URL
} else if (str_starts(link, "/")) {
return(paste0(base_url, link)) # It's a root-relative URL
} else {
# This case might be more complex, e.g., current_path/file.pdf
# For simplicity, we'll assume most downloadable files are root-relative or absolute.
# You might need more sophisticated URL parsing if this is an issue.
return(NA) # Mark as NA if it's not directly handleable
}
})
Remove any NA URLs that couldn't be resolved
full_pdf_urls <- full_pdf_urls[!is.na(full_pdf_urls)]
if (length(full_pdf_urls) == 0) {
cat("No PDF links found on the page.\n")
} else {
cat("Found", length(full_pdf_urls), "PDF links.\n")
6. Define download directory
download_dir <- "downloaded_pdfs"
if (!dir.exists(download_dir)) {
dir.create(download_dir)
}
7. Download each PDF file
for (url in full_pdf_urls) {
# Extract filename from URL (e.g., "file.pdf" from "http://example.com/path/file.pdf")
filename <- basename(url)
destination_path <- file.path(download_dir, filename)
cat("Downloading:", url, "to", destination_path, "\n")
tryCatch({
download(url, destination_path, mode = "wb")
cat(" Successfully downloaded:", filename, "\n")
}, error = function(e) {
cat(" Error downloading", filename, ":", e$message, "\n")
})
}
cat("\nAll PDF download attempts completed.\n")
}