Blog

ruby如何获取http中的div元素

2023-08-14 06:30PM

参考：https://nokogiri.org/tutorials/searching_a_xml_html_document.html#basic-searching

https://stackoverflow.com/questions/4232345/get-div-nested-in-div-element-using-nokogiri

使用 nokogiri

1. 安装：

Ruby >= 2.7 使用

$ gem install nokogiri

或者在Gemfile文件中增加：

gem 'nokogiri'

# 然后运行bundle install

然后可以使用XPath或者CSS都可以

XPath：

@doc = Nokogiri::XML(File.open("shows.xml"))

@doc.xpath("//character")
# => ["<character>Al Bundy</character>",
#    "<character>Bud Bundy</character>",
#    "<character>Marcy Darcy</character>",
#    "<character>Larry Appleton</character>",
#    "<character>Balki Bartokomous</character>",
#    "<character>John \"Hannibal\" Smith</character>",
#    "<character>Templeton \"Face\" Peck</character>",
#    "<character>\"B.A.\" Baracus</character>",
#    "<character>\"Howling Mad\" Murdock</character>"]

CSS：

@doc = Nokogiri::XML(File.open("shows.xml"))

@doc.xpath("//dramas//character")
# => ["<character>John \"Hannibal\" Smith</character>",
#    "<character>Templeton \"Face\" Peck</character>",
#    "<character>\"B.A.\" Baracus</character>",
#    "<character>\"Howling Mad\" Murdock</character>"]

只获取里面的节点内容：

doc = Nokogiri::Slop <<-EOXML
<employees>
<employee status="active">
<fullname>Dean Martin</fullname>
</employee>
<employee status="inactive">
<fullname>Jerry Lewis</fullname>
</employee>
</employees>
EOXML

# navigate!
doc.employees.employee.last.fullname.content # => "Jerry Lewis"

# access node attributes!
doc.employees.employee.first["status"] # => "active"

# use some xpath!
doc.employees.employee("[@status='active']").fullname.content # => "Dean Martin"
doc.employees.employee(:xpath => "@status='active'").fullname.content # => "Dean Martin"

# use some css!
doc.employees.employee("[status='active']").fullname.content # => "Dean Martin"
doc.employees.employee(:css => "[status='active']").fullname.content # => "Dean Martin"

eg:

从给定的 URL 获取响应的主体内容，并使用 Nokogiri 库解析该内容为 HTML 对象。然后，它选择具有 class 属性为 "time" 的元素，并打印出第一个匹配的元素

# 定义了一个名为 get_post_body 的方法，该方法接受一个参数 url

def get_post_body url

# 使用 HTTParty.get 方法向给定的 url 发起 GET 请求，并将返回的响应存储在 request 变量中

request = HTTParty.get(url)

# 使用 Nokogiri::HTML 将 request.body（即响应的主体内容）解析为一个可操作的 HTML 对象，并将解析结果存储在 parsed_body 变量

parsed_body = Nokogiri::HTML(request.body)

# 使用 parsed_body.css('.time')[0] 选择解析后的 HTML 中的所有具有 class 属性为 "time" 的元素，并打印出第一个匹配的元素

puts parsed_boyd.css(' .time ')[0]

end

eg：只获取time里面的匹配元素的内容

# 选择解析后的 HTML 中具有 class 属性为 "time" 的元素，并打印出第一个匹配的元素的内容（即文本内容）。

puts parsed_boyd.css(' .time ')[0].content

返回>>

请登录后再发表评论。

评论列表:

目前还没有人发表评论