xpath

基本介绍

  • XPath(XML Path Language)是⼀种XML的查询语⾔,他能在XML树状结构中寻找节点。XPath ⽤于在 XML ⽂档中通过元素和属性进⾏导航

  • xml是⼀种标记语法的⽂本格式,xpath可以⽅便的定位xml中的元素和其中

    的属性值。lxml是python中的⼀个第三⽅模块,它包含了将html⽂本转成

    xml对象,和对对象执⾏xpath的功能

节点关系

1
2
3
4
5
6
7
8
9
10
xml_content = ''' 
<bookstore>
<book>
<title lang='eng'>Harry Potter</title>
<author>JK.Rowing</author>
<year>2005</year>
<price>29<price>
</book>
</bookstore>
'''
  • 其中 文档节点 JK.Rowing 元素节点 lang=’eng’ 属性节点
  • ⽗(Parent) book元素是title、author、year、price元素的⽗
  • ⼦(Children) title、author、year、price都是book元素的⼦
  • 同胞(Sibling) title、author、year、price都是同胞
  • 先辈(Ancestor) title元素的先辈是 book元素和bookstore元素

基本工具

  • nodename 选取节点所有子节点
  • / 从根节点获取
  • // 从匹配选择的当前节点选择文档中的节点,不考虑他们的位置
  • .选取当前节点
  • ..选取当前节点父节点
  • @ 选取属性

查找特定节点

  • /bookestore/book[1] 选取bookstore子元素的第一个book元素
  • /bookestore/book[last()] 选取bookstore子元素的最后一个book元素
  • /bookestore/book[last()-1] 选取bookstore子元素的倒数第二个book元素
  • /bookestore/book[position()<3] 选取前两个属于bookstore的子元素的book元素
  • //title[@lang] 选取名为lang属性的title元素
  • //title[@lang=‘eng’]选取所有title元素,且这些元素拥有值为eng的lang属性
  • /bookestore/book[price()>35] 选取bookstore元素的所有book元素,且其中price元素值大于35

xpath模块

格式转换

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from lxml import html
etree = html.etree
wb_data = """
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
"""
wb = etree.HTML(wb_data)
print(wb) #<Element html at 0x22bbbd36888>

以文本方式返回

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from lxml import html
etree = html.etree
wb_data = """
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
"""
wb = etree.HTML(wb_data)
result = etree.tostring(wb) #以文本方式返回
r = result.decode('utf-8')
print(r,type(r)) #type为str

获取li下a标签的href

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from lxml import html
etree = html.etree
wb_data = """
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
"""
wb = etree.HTML(wb_data)
#获取li标签下a标签的herf
link = wb.xpath('//li/a/@href')
print(link,type(link)) #['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html'] <class 'list'>

获取a标签下文本数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from lxml import html

etree = html.etree


wb_data = """
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
"""
wb = etree.HTML(wb_data)
#获取a标签下文本数据
s = wb.xpath('//li/a/text()')
print(s,type(s)) #['first item', 'second item', 'third item', 'fourth item', 'fifth item'] <class 'list'>

将获取数据存放至字典中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from lxml import html
etree = html.etree
wb_data = """
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
"""
wb = etree.HTML(wb_data)
#获取li标签下a标签的herf
link = wb.xpath('//li/a/@href')
#获取a标签下文本数据
s = wb.xpath('//li/a/text()')
#把获取的数据放到一个字典中
for l in link:
d = {}
d['href'] = l
#获取下表索引值
d['title'] = s[link.index(l)]
print(d)
#{'href': 'link1.html', 'title': 'first item'}
#{'href': 'link2.html', 'title': 'second item'}
#{'href': 'link3.html', 'title': 'third item'}
#{'href': 'link4.html', 'title': 'fourth item'}
#{'href': 'link5.html', 'title': 'fifth item'}