PythonによるWebスクレイピング実践：4つの代表的なユースケース

1. Eコマースサイトの商品情報取得

例えば、中国の主要ECプラットフォーム「JD.com」の特定商品ページ（例：一加9Rスマートフォン）を対象に、HTTPリクエストによるHTMLコンテンツの取得を試みます。URLはhttps://item.jd.com/100020542894.htmlです。

まずrobots.txtを確認します：https://item.jd.com/robots.txt。実際の内容は次のような形式で、Googlebotなど特定のクローラーに対して制限が設定されていませんが、これは動的に変更される可能性があるため、常に最新状態を確認する必要があります。

User-agent: Googlebot
Disallow:
User-agent: AdsBot-Google
Disallow:
User-agent: Googlebot-Image
Disallow:

基本的なリクエストにはユーザーエージェント偽装が必要です。デフォルトのpython-requests/2.xはブロックされる場合があるため、ブラウザ風のヘッダーを付与します。

import requests

target_url = "https://item.jd.com/100020542894.html"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}

try:
    response = requests.get(target_url, headers=headers, timeout=10)
    response.raise_for_status()
    response.encoding = response.apparent_encoding
    print("ステータス:", response.status_code)
    print("タイトル（先頭100文字）:", response.text.split("<title>")[1].split("</title>")[0][:100])
except requests.exceptions.RequestException as e:
    print(f"リクエスト失敗: {e}")

2. 検索エンジンのキーワード検索結果取得

百度（Baidu）と360搜索（So.com）の検索APIを活用し、クエリパラメータを動的に組み立てて結果を取得します。百度はHTTPベースの検索エンドポイントhttp://www.baidu.com/s、360搜索はHTTPSのhttps://www.so.com/sを使用します。

日本語や中国語のキーワードは自動的にURLエンコードされ、params引数で安全に渡せます。

import requests

search_term = "ウルトラマンティガ"

# 百度検索
baidu_params = {"wd": search_term}
baidu_url = "http://www.baidu.com/s"
baidu_resp = requests.get(baidu_url, params=baidu_params, headers={"User-Agent": "Mozilla/5.0"}, timeout=8)

# 360搜索
so_params = {"q": search_term}
so_url = "https://www.so.com/s"
so_resp = requests.get(so_url, params=so_params, headers={"User-Agent": "Mozilla/5.0"}, timeout=8)

print(f"百度URL: {baidu_resp.url}")
print(f"百度レスポンス長: {len(baidu_resp.text)} 文字")

print(f"360搜索URL: {so_resp.url}")
print(f"360搜索レスポンス長: {len(so_resp.text)} 文字")

3. Web上の画像ファイルのダウンロードと保存

静的画像リソース（例：中国国家地理の高解像度写真）を直接バイナリで取得し、ローカルディスクに保存します。この手法はCDN経由の画像にも有効ですが、CORSやReferer制限に注意が必要です。

import requests
import os

image_url = "http://img0.dili360.com/pic/2021/04/28/6088f9a33e0ad1s23538607.jpg"
local_dir = "./downloads"
local_path = os.path.join(local_dir, os.path.basename(image_url))

# ディレクトリ作成
os.makedirs(local_dir, exist_ok=True)

try:
    img_response = requests.get(image_url, timeout=15)
    img_response.raise_for_status()
    
    with open(local_path, "wb") as file:
        file.write(img_response.content)
    
    print(f"画像を {local_path} に保存しました（{len(img_response.content)} バイト）")
except Exception as err:
    print(f"画像取得失敗: {err}")

4. IPアドレスの地理情報照会

IP138などの公開IP情報サービスを活用して、任意のIPアドレスの所在地・プロバイダ情報を取得します。注意点として、レスポンスの文字エンコーディング（例：gb2312）が明示されていない場合、apparent_encodingで自動推定する必要があります。

import requests

ip_query_url = "https://www.ip138.com/iplookup.asp"
query_params = {"ip": "144.144.144.144", "action": "2"}

try:
    ip_resp = requests.get(
        ip_query_url,
        params=query_params,
        headers={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)"},
        timeout=10
    )
    ip_resp.raise_for_status()
    ip_resp.encoding = ip_resp.apparent_encoding
    
    # タイトル部分を抽出（簡易パース）
    title_start = ip_resp.text.find("<title>") + len("<title>")
    title_end = ip_resp.text.find("</title>", title_start)
    title = ip_resp.text[title_start:title_end].strip() if title_start > 0 else "未取得"
    
    print(f"照会IP: 144.144.144.144")
    print(f"ページタイトル: {title}")
    print(f"レスポンスサイズ: {len(ip_resp.text)} 文字")
    
except requests.exceptions.RequestException as e:
    print(f"IP照会エラー: {e}")

タグ: Python requests web-scraping http-client encoding-handling

6月13日 22:30 投稿

異端開発室

PythonによるWebスクレイピング実践：4つの代表的なユースケース

1. Eコマースサイトの商品情報取得

2. 検索エンジンのキーワード検索結果取得

3. Web上の画像ファイルのダウンロードと保存

4. IPアドレスの地理情報照会

ホットタグ