[pyhon] 웹 크롤러[정규표현식] #이미지다운

[웹 크롤러 만들기]

import requests
import urllib.request
import re

# [img 태그]
URL = 'https://www.daum.net/'
headers = {'Content-Type': 'application/json; charset=utf-8'}
res = requests.get(URL, headers=headers)
html = res.text
re_img = re.compile("<[Ii][Mm][Gg]\s+[^>]+>", re.MULTILINE)

img_tag = re_img.findall(html)

# [src 속성]
re_src = re.compile("[Ss][Rr][Cc][^\s]+", re.MULTILINE)

img_src = re_src.findall('\n'.join(img_tag))

# [data-src 속성]
re_data_src = re.compile("[Dd][Aa][Tt][Aa][-][Ss][Rr][Cc][^\s]+", re.MULTILINE)

img_data_src = re_data_src.findall('\n'.join(img_tag))

# [src 주소]
re_url_src = re.compile("[\"][^\s]+", re.MULTILINE)

img_url_src = re_url_src.findall('\n'.join(img_src))

# [data-src 주소]
re_url_data_src = re.compile("[\"][^\s]+", re.MULTILINE)

img_url_data_src = re_url_src.findall('\n'.join(img_data_src))

# [" 제거]
img_url_src = '\n'.join(img_url_src).replace("\"", "")
img_url_data_src = '\n'.join(img_url_data_src).replace("\"", "")

# [file 쓰기모드 open]
f = open('text.txt', 'w+')
f.write(img_url_src)
f.write(img_url_data_src)
f.close

# [file 읽기모드 open]
f = open('text.txt', 'rt')

i = 0

# [file에서 한줄씩 읽기]
for url in f:

    if url is None:
        continue

    a = url.find("http")

    # [http가 없을 경우]
    if a == -1:
        i = i + 1
        img_name = str(i) + ".jpg"
        c = "http:" + url
        print(img_name)
        urllib.request.urlretrieve(c, "./img/" + img_name)
    # [http가 있을 경우]
    else:
        i = i + 1
        img_name = str(i) + ".jpg"
        print(img_name)
        urllib.request.urlretrieve(url[a:], "./img/" + img_name)

f.close

[python] requests 모듈 정리

'Language_ > python' 카테고리의 다른 글

[python] PyCharm 모듈 설치방법 (0)	2018.11.03
[python] pwntools 모듈 정리 (2)	2018.11.03
[pyhon] 웹 크롤러[정규표현식] #a태그 (0)	2018.08.24
[python] requests 모듈 정리 (4)	2018.08.19
[python] 환경변수 설정 (0)	2018.08.18

낭람

[pyhon] 웹 크롤러[정규표현식] #이미지다운

'Language_ > python' 카테고리의 다른 글

댓글

티스토리툴바

[pyhon] 웹 크롤러[정규표현식] #이미지다운

'Language_ > python' 카테고리의 다른 글

관련글

댓글

티스토리툴바