Beautiful Soupを使ったPythonでのHTML解析入門

1. Beautiful Soupライブラリの導入

「美味しいスープ」の名で親しまれるBeautiful Soupは、HTMLやXML形式のデータを解析し、必要な情報を抽出するためのPythonサードパーティライブラリです。公式サイトは「https://www.crummy.com/software/BeautifulSoup/」です。

インストールはpipコマンドで行います。以下のコマンドを実行してください。

pip install BeautifulSoup4

インストール中に「WARNING: You are using pip version 20.2.3; however, version 20.2.4 is available.」と表示されても慌てる必要はありません。これはpipのアップグレードを促すもので、必須ではありません。インストールが成功したかは「pip list」で確認できます。

インストール後、テストとして「https://python123.io/ws/demo.html」というページを使います。このページのソースコードは以下の通りです。

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

IDLEで次のコードを試してみましょう。

>>> import requests
>>> r = requests.get("https://python123.io/ws/demo.html")
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo, "html.parser")
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

2. Beautiful Soupの基本要素

2.1 Beautiful Soupの理解

Beautiful SoupはHTMLやXMLファイルを解析するライブラリであり、「タグツリー」を解析・走査・操作するためのツールです。HTMLファイルを開くと、尖括弧で構成されたタグが階層構造を形成しており、これがタグツリーです。

タグの構造を見てみましょう。例えばpタグの場合、「p」がタグ名で、対になって現れます。「<p>...</p>」がタグペア（tag）です。「class="title"」は属性領域で、0個以上の属性を持ちます。属性はタグの特徴を定義します。この例では属性名「class」、属性値「title」の一つの属性があります。属性はキーと値のペアです。

2.2 Beautiful Soupの参照方法

Beautiful Soupライブラリはbeautifulsoup4またはbs4とも呼ばれます。一般的な参照方法は以下です。

from bs4 import BeautifulSoup

これはbs4ライブラリからBeautifulSoup型をインポートします。また、以下のようにライブラリ全体をインポートすることもできます。

import bs4

2.3 BeautifulSoupクラス

Beautiful SoupはHTMLやXMLドキュメントを解析し、タグツリーとBeautifulSoupクラスを対応付けます。タグツリーを文字列と見なすと、BeautifulSoupクラスがそれを表現する型になります。つまり、HTML/XMLドキュメント、タグツリー、BeautifulSoupクラスは等価です。

2.4 Beautiful Soupのパーサー

パーサー	使用方法	条件
bs4のHTMLパーサー	BeautifulSoup(mk, 'html.parser')	bs4インストール
lxmlのHTMLパーサー	BeautifulSoup(mk, 'lxml')	pip install lxml
lxmlのXMLパーサー	BeautifulSoup(mk, 'xml')	pip install lxml
html5libのパーサー	BeautifulSoup(mk, 'html5lib')	pip install html5lib

通常はhtmlパーサーで十分ですが、XML処理や高速性が必要な場合は他のパーサーも使用可能です。

2.5 BeautifulSoupクラスの基本要素

要素	説明
Tag	タグ。<>と</>で区切られる基本単位
Name	タグ名。<Tag>.nameで取得
Attributes	タグの属性。辞書形式。<Tag>.attrs
NavigableString	タグ内の非属性文字列。<Tag>.string
Comment	コメント部分。特殊なComment型

以下はデモページを使った例です。

>>> import requests
>>> r = requests.get("https://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo, "html.parser")
>>> soup.title
<title>This is a python demo page</title>
>>> tag = soup.a
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.name
'a'
>>> soup.a.parent.name
'p'
>>> soup.a.parent.parent.name
'body'
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>
>>> soup.a.string
'Basic Python'
>>> soup.p.string
'The demo python introduces several python courses.'
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>

Comment要素の例です。

>>> from bs4 import BeautifulSoup
>>> newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>", "html.parser")
>>> newsoup.b.string
'This is a comment'
>>> type(newsoup.b.string)
<class 'bs4.element.Comment'>
>>> newsoup.p.string
'This is not a comment'
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>

3. bs4を使ったHTMLコンテンツの走査方法

HTMLはツリー構造を持ち、3つの走査方法があります：下向き（ルート→リーフ）、上向き（リーフ→ルート）、平行（兄弟間）。

3.1 下向き走査

属性	説明
.contents	子ノードのリスト
.children	子ノードのイテレータ
.descendants	全子孫ノードのイテレータ

.contentsの例：

>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
['\n', <p class="title">...</p>, '\n', <p class="course">...</p>, '\n']
>>> len(soup.body.contents)
5
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>

.childrenと.descendantsの例：

>>> for child in soup.body.children:
    print("child:%s" %(child))
child:

child:<p class="title">...</p>
child:

child:<p class="course">...</p>
child:

>>> for child in soup.body.descendants:
    print("child:%s" %(child))
child:

child:<p class="title">...</p>
child:<b>...</b>
child:The demo python introduces several python courses.
...  # その他多数

注意：改行文字もノードとして扱われます。

3.2 上向き走査

属性	説明
.parent	親タグ
.parents	全先祖タグのイテレータ

例：

>>> soup.title.parent
<head><title>This is a python demo page</title></head>
>>> soup.html.parent
<html>...</html>
>>> for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
p
body
html
[document]

3.3 平行走査

属性	説明
.next_sibling	次の兄弟ノード
.previous_sibling	前の兄弟ノード
.next_siblings	後続の全兄弟ノードのイテレータ
.previous_siblings	前方の全兄弟ノードのイテレータ

条件：兄弟ノードは同じ親を持つ必要があります。NavigableStringもノードとして扱われます。

.next_siblingと.previous_siblingの例：

>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language...'
>>> soup.a.previous_sibling.previous_sibling
>>> soup.a.parent
<p class="course">...</p>

.next_siblingsと.previous_siblingsの例：

>>> for sibling in soup.a.next_siblings:
    print("sibling:%s" %(sibling))
sibling: and 
sibling:<a class="py2" ...>Advanced Python</a>
sibling:.
>>> for sibling in soup.a.previous_siblings:
    print("sibling:%s" %(sibling))
sibling:Python is a wonderful general-purpose programming language...

4. HTMLフォーマット出力

bs4のprettify()メソッドを使うと、HTMLを整形して表示できます。

>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  ...
 </body>
</html>

特定のタグにも適用できます。

>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>

エンコーディングについて：bs4は入力されたHTMLをUTF-8に変換します。Python 3.xでは問題なく動作しますが、Python 2.xを使用する場合は変換が必要な場合があります。

タグ: BeautifulSoup Python HTML解析 Webスクレイピング bs4

6月1日 17:37 投稿

異端開発室