8wDlpd.png
8wDFp9.png
8wDEOx.png
8wDMfH.png
8wDKte.png

确定为什么抓取网站仅适用于 POST 请求正文字符串而不适用于其他字符串

riffnl 2月前

24 0

我正在寻找从纽约电网中抓取可公开获得的表格,网址为:http://icap.nyiso.com/ucap/public/auc_view_spot_detail.do 我可以在夏季这样做,但不能......

我正在寻找从纽约电网中获取可公开获得的表格,网址为: http://icap.nyiso.com/ucap/public/auc_view_spot_detail.do

我可以在夏季这样做,但不能在冬季这样做。我不清楚我错过了什么,所以我希望有更聪明的人能帮我解答。

下面是我的过程,从页面截图开始。

Nyiso inspect

和 hitamp; Season & Month 的组合 Display 才能生成表格。我复制了请求标头信息,包括我作为 POST 请求主体包含的 URL 编码负载。

# libraries
library(jsonlite) 
library(lubridate)
library(data.table)
library(httr)
library(rvest)

# get session and cookies
initial_url <- "http://icap.nyiso.com/ucap/public/auc_view_spot_detail.do"
initial_response <- GET(initial_url)
cookie_data <- cookies(initial_response)
cookie_string <- paste0(cookie_data$name, "=", cookie_data$value, collapse = "; ")


# Define the POST request headers, including cookies
headers <- c(
  "Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
  "Accept-Encoding" = "gzip, deflate",
  "Accept-Language" = "en-US,en;q=0.9",
  "Cache-Control" = "max-age=0",
  "Connection" = "keep-alive",
  "Content-Length" = "85",
  "Content-Type" = "application/x-www-form-urlencoded",
  "Cookie" = cookie_string,
  "Host" = "icap.nyiso.com",
  "Origin" = "null",
  "Upgrade-Insecure-Requests" = "1",
  "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"
)

# Define the URL for the POST request
post_url <- "http://icap.nyiso.com/ucap/public/auc_view_spot_detail.do"

# Below is working code for a "Summer" season:
response <- POST(post_url, add_headers(.headers = headers), encode = "form", 
                 body = "seasonId=702793&seasonId=Summer+2024&month=05%2F2024&month=May%2F2024&display=Display")
html_content <- content(response, as = "text")
html <- read_html(html_content)
tables <- html %>% html_nodes("table")
html_table(tables[4]) # print
#[[1]]
## A tibble: 45 × 2
#   X1                           X2            
#   <chr>                        <chr>         
# 1 "05/2024"                    "05/2024"     
# 2 "G-J Locality"               "G-J Locality"
# 3 "Awarded Deficiency (MW)"    "1,888.1"     
# 4 "Awarded Excess (MW)"        "1,694.400"   
# 5 "% Excess Above Requirement" "14.78"       
# 6 "Price ($/kW-M)"             "$4.27"       
# 7 ""                           ""            
# 8 "LI"                         "LI"          
# 9 "Awarded Deficiency (MW)"    "176.9"       
#10 "Awarded Excess (MW)"        "519.200"     
## ℹ 35 more rows
## ℹ Use `print(n = ...)` to see more rows

奇怪的是,如果我在冬季更换车身,这个过程就不起作用了,正如检查网络所显示的那样。知道我可能错过了什么吗?

nyiso inspect 2

# does not work to generate the data
response <- POST(post_url, add_headers(.headers = headers), encode = "form", 
                 body = "seasonId=702409&seasonId=Winter+2023-2024&month=02%2F2024&month=Feb%2F2024&display=Display")
html_content <- content(response, as = "text")
html <- read_html(html_content)
tables <- html %>% html_nodes("table")
html_table(tables[4]) # there is no such table
   

我注意到一些奇怪的行为:

  • 有重复的主体参数,但如果我删除其中任何一个,它都不会起作用。
  • 只要字符串 ( seasonId=702793 您可以将 seasonId 编号 ( ) 更改 seasonId=Summer+2024 。其他 ID 位于此处 http://icap.nyiso.com/ucap/rest/seasons/public )

我也无法找到针对表中实际数据的特定公共 rest api。

感谢您的时间和想法。

这里有一堆我用来确定这只是冬季才会出现的问题的身体细线:

body_strings <- c("seasonId=700085&seasonId=Winter+2021-2022&month=01%2F2022&month=Jan%2F2022&display=Display", 
"seasonId=700085&seasonId=Winter+2021-2022&month=02%2F2022&month=Feb%2F2022&display=Display", 
"seasonId=700085&seasonId=Winter+2021-2022&month=03%2F2022&month=Mar%2F2022&display=Display", 
"seasonId=700085&seasonId=Winter+2021-2022&month=04%2F2022&month=Apr%2F2022&display=Display", 
"seasonId=700490&seasonId=Summer+2022&month=05%2F2022&month=May%2F2022&display=Display", 
"seasonId=700490&seasonId=Summer+2022&month=06%2F2022&month=Jun%2F2022&display=Display", 
"seasonId=700490&seasonId=Summer+2022&month=07%2F2022&month=Jul%2F2022&display=Display", 
"seasonId=700490&seasonId=Summer+2022&month=08%2F2022&month=Aug%2F2022&display=Display", 
"seasonId=700490&seasonId=Summer+2022&month=09%2F2022&month=Sep%2F2022&display=Display", 
"seasonId=700490&seasonId=Summer+2022&month=10%2F2022&month=Oct%2F2022&display=Display", 
"seasonId=700882&seasonId=Winter+2022-2023&month=11%2F2022&month=Nov%2F2022&display=Display", 
"seasonId=700882&seasonId=Winter+2022-2023&month=12%2F2022&month=Dec%2F2022&display=Display", 
"seasonId=700882&seasonId=Winter+2022-2023&month=01%2F2023&month=Jan%2F2023&display=Display", 
"seasonId=700882&seasonId=Winter+2022-2023&month=02%2F2023&month=Feb%2F2023&display=Display", 
"seasonId=700882&seasonId=Winter+2022-2023&month=03%2F2023&month=Mar%2F2023&display=Display", 
"seasonId=700882&seasonId=Winter+2022-2023&month=04%2F2023&month=Apr%2F2023&display=Display", 
"seasonId=701280&seasonId=Summer+2023&month=05%2F2023&month=May%2F2023&display=Display", 
"seasonId=701280&seasonId=Summer+2023&month=06%2F2023&month=Jun%2F2023&display=Display", 
"seasonId=701280&seasonId=Summer+2023&month=07%2F2023&month=Jul%2F2023&display=Display", 
"seasonId=701280&seasonId=Summer+2023&month=08%2F2023&month=Aug%2F2023&display=Display", 
"seasonId=701280&seasonId=Summer+2023&month=09%2F2023&month=Sep%2F2023&display=Display", 
"seasonId=701280&seasonId=Summer+2023&month=10%2F2023&month=Oct%2F2023&display=Display", 
"seasonId=702409&seasonId=Winter+2023-2024&month=11%2F2023&month=Nov%2F2023&display=Display", 
"seasonId=702409&seasonId=Winter+2023-2024&month=12%2F2023&month=Dec%2F2023&display=Display", 
"seasonId=702409&seasonId=Winter+2023-2024&month=01%2F2024&month=Jan%2F2024&display=Display", 
"seasonId=702409&seasonId=Winter+2023-2024&month=02%2F2024&month=Feb%2F2024&display=Display", 
"seasonId=702409&seasonId=Winter+2023-2024&month=03%2F2024&month=Mar%2F2024&display=Display", 
"seasonId=702409&seasonId=Winter+2023-2024&month=04%2F2024&month=Apr%2F2024&display=Display", 
"seasonId=702793&seasonId=Summer+2024&month=05%2F2024&month=May%2F2024&display=Display", 
"seasonId=702793&seasonId=Summer+2024&month=06%2F2024&month=Jun%2F2024&display=Display", 
"seasonId=702793&seasonId=Summer+2024&month=07%2F2024&month=Jul%2F2024&display=Display", 
"seasonId=702793&seasonId=Summer+2024&month=08%2F2024&month=Aug%2F2024&display=Display"
)
帖子版权声明 1、本帖标题:确定为什么抓取网站仅适用于 POST 请求正文字符串而不适用于其他字符串
    本站网址:http://xjnalaquan.com/
2、本网站的资源部分来源于网络,如有侵权,请联系站长进行删除处理。
3、会员发帖仅代表会员个人观点,并不代表本站赞同其观点和对其真实性负责。
4、本站一律禁止以任何方式发布或转载任何违法的相关信息,访客发现请向站长举报
5、站长邮箱:yeweds@126.com 除非注明,本帖由riffnl在本站《r》版块原创发布, 转载请注明出处!
最新回复 (0)
  • 您的问题是,您在标题中指定了内容长度,但您没有在内容字符串中遵守该长度('2023-2024 年冬季' 比'2023 年夏季' 长)。

    这里的部分问题在于您过度指定了请求,这使其更难调试。您不需要初始 GET 请求、cookie、用户代理或大多数其他标头。

    以下内容在干净的会话中完全可重现

    library(httr)
    library(rvest)
    
    headers <- c(`Connection` = "keep-alive",
                 `Content-Type`  = "application/x-www-form-urlencoded",
                 `Upgrade-Insecure-Requests` = "1")
    
    POST("http://icap.nyiso.com/ucap/public/auc_view_spot_detail.do", 
         body = paste0("seasonId=702409",
                       "&seasonId=Winter+2023-2024",
                       "&month=02%2F2024",
                       "&month=Feb%2F2024",
                       "&display=Display"), 
         add_headers(.headers = headers)) %>%
      content(as = "text") %>%
      read_html() %>% 
      html_nodes("table") %>%
      getElement(4) %>%
      html_table()
    #> # A tibble: 45 x 2
    #>    X1                           X2            
    #>    <chr>                        <chr>         
    #>  1 "02/2024"                    "02/2024"     
    #>  2 "G-J Locality"               "G-J Locality"
    #>  3 "Awarded Deficiency (MW)"    "2,620.8"     
    #>  4 "Awarded Excess (MW)"        "1,748.600"   
    #>  5 "% Excess Above Requirement" "14.16"       
    #>  6 "Price ($/kW-M)"             "$4.56"       
    #>  7 ""                           ""            
    #>  8 "LI"                         "LI"          
    #>  9 "Awarded Deficiency (MW)"    "42.8"        
    #> 10 "Awarded Excess (MW)"        "859.700"     
    #> # i 35 more rows
    
返回
作者最近主题: