Beautiful Soup 웹스크래핑

파이썬 모듈을 사용하여 HTML코드를 스크래핑할 수 있다.

우선 Beautiful Soup를 가져온다.

from bs4 import BeautifulSoup

파일을 열때와 동일하게 진행한다.

encoding="UTF8"로 지정한 이유는 코드 실행시 아래 오류코드가 발생했기 때문에 구글링하여 지정해주었다.

UnicodeDecodeError: 'cp949' codec can't decode byte 0xe2 in position 280: illegal multibyte sequence

with open("website.html",encoding="UTF8") as file :
    contents = file.read()

BeautifulSoup()를 지정해주고 읽을 파일(contents)와 parser를 "html.parser"로 지정한다.

soup = BeautifulSoup(contents, "html.parser")

제목을 가져올 수 있고,

print(soup.title)
# <title>Angela's Personal Site</title>print(soup.title.string)
# Angela's Personal Site

prettify()함수로 전체 html 코드를 가져올 수 있다.

print(soup.prettify())

객체로 지정한 soup뒤에 .태그명 을 붙혀주면 해당 태그를 가져올 수 있다.

find_all 함수를 사용하여 html 내에 모든 코드를 가져올 수 있다.

all_anchor_tag = soup.find_all(name="a")
print(all_anchor_tag)

[<a href="<https://www.appbrewery.co/>">The App Brewery</a>, <a href="<https://angelabauer.github.io/cv/hobbies.html>">My Hobbies</a>, <a href="<https://angelabauer.github.io/cv/contact-me.html>">Contact Me</a>]

for 구문과 getText()를 사용하여 앵커태그 내의 텍스트만 가져오는것도 가능하다.

all_anchor_tag = soup.find_all(name="a")

for tag in all_anchor_tag :
    print(tag.getText())

The App Brewery
My Hobbies
Contact Me

get("href")로 앵커태그의 링크만 가져올 수도 있다.

all_anchor_tag = soup.find_all(name="a")

for tag in all_anchor_tag :
    print(tag.get("href"))

<https://www.appbrewery.co/>
<https://angelabauer.github.io/cv/hobbies.html>
<https://angelabauer.github.io/cv/contact-me.html>