Ruby获取网页的title


gem install uri
gem install nokogiri

Create a Ruby script called get_titles.rb and add the following code to load the libraries, open a URL as a file, send its contents to Nokogiri, and extract the value of the <title> tag:
Copy
require 'nokogiri'
require 'open-uri'

url = "https://google.com" 
URI.open(url) do |f|
  doc = Nokogiri::HTML(f)
  title = doc.at_css('title').text
  puts title
end

Save the file and run the program:

Copy
ruby get_titles.rb

The result shows the page title for Google:

Google

To do this for multiple URLs, put the URLs in an array manually, or get them from a file.

Reading URLs from a File

You may already have the list of URLs in a file, which may have come from a data export. Using Ruby’s File.readlines, you can quickly convert the file into an array.

Create a new file called links.txt and add a couple of links. Make sure one of them is a bad URL; you’ll make sure to handle errors.

https://google.com
https://devto

Save the file.

Now return to your get_titles.rb file and modify the code so it reads the file in line-by-line, and uses each line as a URL:

Copy
# get_titles.rb
require 'nokogiri'
require 'open-uri'

lines = File.readlines('links.txt')
lines.each do |line|
  url = line.chop
  URI.open(url) do |f|
    doc = Nokogiri::HTML(f)
    title = doc.at_css('title').text
    puts title
  end
rescue SocketError
  puts "#{url}: can't connect. Bad URL?"
end

Each line from the file will have a line break at the end, which you remove with the .chop method before storing the value in the url variable.

The URI.open method will throw a SocketError if it can’t connect, and so you rescue that error with a sensible message.

Save the file and run the program again:



阅读量: 1068
发布于:
修改于: