PYTHON의 XML 구문 분석

이 기사에서는 주어진 XML 파일을 구문 분석하고 구조화된 방식으로 유용한 데이터를 추출하는 방법에 중점을 둡니다. XML: XML은 eXtensible Markup Language를 의미합니다. 데이터를 저장하고 전송하도록 설계되었습니다. XML은 사람과 기계가 모두 읽을 수 있도록 설계되었습니다. 이것이 바로 XML의 설계 목표가 단순성 일반성과 인터넷에서의 유용성을 강조하는 이유입니다. 이 튜토리얼에서 구문 분석할 XML 파일은 실제로 RSS 피드입니다. RSS: RSS(Really Simple Syndication이라고도 불리는 Rich Site Summary)는 표준 웹 피드 형식 계열을 사용하여 블로그 항목 뉴스 헤드라인 오디오 비디오와 같이 자주 업데이트되는 정보를 게시합니다. RSS는 XML 형식의 일반 텍스트입니다.

RSS 형식 자체는 자동화된 프로세스와 사람 모두가 비교적 쉽게 읽을 수 있습니다.
이 튜토리얼에서 처리되는 RSS는 인기 뉴스 웹사이트의 주요 뉴스 기사에 대한 RSS 피드입니다. 확인하실 수 있습니다 여기 . 우리의 목표는 이 RSS 피드(또는 XML 파일)를 처리하고 나중에 사용할 수 있도록 다른 형식으로 저장하는 것입니다.

사용된 Python 모듈: 이 기사에서는 내장 사용에 중점을 둘 것입니다. xml XML을 구문 분석하기 위한 Python 모듈이며 주요 초점은 요소트리 XML API 이 모듈의. 구현: Python

#Python code to illustrate parsing of XML files # importing the required modules import csv import requests import xml.etree.ElementTree as ET def loadRSS(): # url of rss feed url = 'http://www.hindustantimes.com/rss/topnews/rssfeed.xml' # creating HTTP response object from given url resp = requests.get(url) # saving the xml file with open('topnewsfeed.xml' 'wb') as f: f.write(resp.content) def parseXML(xmlfile): # create element tree object tree = ET.parse(xmlfile) # get root element root = tree.getroot() # create empty list for news items newsitems = [] # iterate news items for item in root.findall('./channel/item'): # empty news dictionary news = {} # iterate child elements of item for child in item: # special checking for namespace object content:media if child.tag == '{https://video.search.yahoo.com/mrss': news['media'] = child.attrib['url'] else: news[child.tag] = child.text.encode('utf8') # append news dictionary to news items list newsitems.append(news) # return news items list return newsitems def savetoCSV(newsitems filename): # specifying the fields for csv file fields = ['guid' 'title' 'pubDate' 'description' 'link' 'media'] # writing to csv file with open(filename 'w') as csvfile: # creating a csv dict writer object writer = csv.DictWriter(csvfile fieldnames = fields) # writing headers (field names) writer.writeheader() # writing data rows writer.writerows(newsitems) def main(): # load rss from web to update existing xml file loadRSS() # parse xml file newsitems = parseXML('topnewsfeed.xml') # store news items in a csv file savetoCSV(newsitems 'topnews.csv') if __name__ == '__main__': # calling main function main()

Above code will:

지정된 URL에서 RSS 피드를 로드하고 XML 파일로 저장합니다.
XML 파일을 구문 분석하여 뉴스를 각 사전이 단일 뉴스 항목인 사전 목록으로 저장합니다.
뉴스 항목을 CSV 파일에 저장합니다.

코드를 부분적으로 이해해 보겠습니다.

def loadRSS(): # url of rss feed url = 'http://www.hindustantimes.com/rss/topnews/rssfeed.xml' # creating HTTP response object from given url resp = requests.get(url) # saving the xml file with open('topnewsfeed.xml' 'wb') as f: f.write(resp.content)

topnewsfeed.xml

구문 분석XML()

xml.etree.ElementTree

요소트리

요소

요소트리

요소

구문 분석XML()

tree = ET.parse(xmlfile)

요소트리

xml 파일.

root = tree.getroot()

뿌리 뽑기()

나무

요소

for item in root.findall('./channel/item'):

목

./채널/항목

XPath

목

채널

뿌리

여기

for item in root.findall('./channel/item'): # empty news dictionary news = {} # iterate child elements of item for child in item: # special checking for namespace object content:media if child.tag == '{https://video.search.yahoo.com/mrss': news['media'] = child.attrib['url'] else: news[child.tag] = child.text.encode('utf8') # append news dictionary to news items list newsitems.append(news)

목

소식

for child in item:

if child.tag == '{https://video.search.yahoo.com/mrss': news['media'] = child.attrib['url']

자식.속성

URL

미디어:컨텐츠

news[child.tag] = child.text.encode('utf8')

아이.태그

자식.텍스트

{'description': 'Ignis has a tough competition already from Hyun....  'guid': 'http://www.hindustantimes.com/autos/maruti-ignis-launch....  'link': 'http://www.hindustantimes.com/autos/maruti-ignis-launch....  'media': 'http://www.hindustantimes.com/rf/image_size_630x354/HT/...  'pubDate': 'Thu 12 Jan 2017 12:33:04 GMT ' 'title': 'Maruti Ignis launches on Jan 13: Five cars that threa..... }

뉴스 기사

CSV()에 저장

이제 형식화된 데이터가 다음과 같이 표시됩니다.

보시다시피 계층적 XML 파일 데이터가 간단한 CSV 파일로 변환되어 모든 뉴스 기사가 테이블 형식으로 저장됩니다. 이렇게 하면 데이터베이스 확장도 더 쉬워집니다. 또한 JSON과 유사한 데이터를 애플리케이션에서 직접 사용할 수도 있습니다! 이는 공개 API를 제공하지 않지만 일부 RSS 피드를 제공하는 웹사이트에서 데이터를 추출하는 가장 좋은 대안입니다. 위 기사에 사용된 모든 코드와 파일을 찾을 수 있습니다. 여기 . 다음은 무엇입니까?

위의 예에서 사용된 뉴스 웹사이트의 더 많은 RSS 피드를 살펴볼 수 있습니다. 다른 RSS 피드도 구문 분석하여 위 예제의 확장 버전을 만들 수 있습니다.
크리켓 팬이신가요? 그 다음에 이것 RSS 피드에 관심이 있으실 겁니다! 이 XML 파일을 구문 분석하여 라이브 크리켓 경기에 대한 정보를 스크랩하고 데스크톱 알림을 만드는 데 사용할 수 있습니다!

HTML 및 XML 퀴즈 퀴즈 만들기

TechCodeview

Python의 XML 구문 분석