程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python lxml cleaning XML clearing node and attribute

編輯:Python

introduction :

Data migration will be encountered in the project , How to ensure that the new data is consistent with the old data , Sometimes content testing is required , This involves XML Comparison of contents , Due to the change of data processing method , Need to ignore differences in expectations , So for special xpath To deal with . Synthesis of various studies , Think lxml Highest efficiency , Yes xml It is very convenient to handle , This article will use an example to solve the problem of data cleaning .
Summary :

  1. XML namespace Summary
  2. lxml Yes XML The operation of
  3. lxml Application cleaning XML

XML namespace

of XML namespace, You can refer to XML Namespace ,namesp It is mainly used to solve naming conflicts , Generally complex XML tag With prefix , The prefix represents the namespace .

for example : This is a normal XML

<schema xmlns="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:B" xmlns:B="urn:B" elementFormDefault="qualified">
<element name="foo">
<complexType>
<element name="bar" type="B:myType"/>
</complexType>
</element>
<complexType name="myType">
<choice>
<element name="baz" type="string"/>
<element name="bas" type="string"/>
</choice>
</complexType>
</schema>

We can also use prefixes to express :
xmlns:myprefix="http://www.w3.org/2001/XMLSchema"

<myprefix:schema xmlns:myprefix="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:B" xmlns:B="urn:B" elementFormDefault="qualified">
<myprefix:element name="foo">
<myprefix:complexType>
<myprefix:element name="bar" type="B:myType"/>
</myprefix:complexType>
</myprefix:element>
<myprefix:complexType name="myType">
<myprefix:choice>
<myprefix:element name="baz" type="string"/>
<myprefix:element name="bas" type="string"/>
</myprefix:choice>
</myprefix:complexType>
</myprefix:schema>

lxml Yes XML The operation of

  1. analysis XML:
import lxml.etree as XE
root = XE.fromstring(xml_content)
  1. obtain namesp
namespaces=root.nsmap
  1. location node
    Be careful : Need to use relative xpath, Absolute... Is not supported xpath, For prefixed tag You have to take it namespace
nodes = root.findall(rele_xpath_ignore, namespaces=root.nsmap)
  1. node remove
node.getparent().remove(node)
  1. attribute eliminate
node.attrib.pop(attri_name_list[0])

lxml Application cleaning XML

demand :
Just take the one on it xml For example

  1. Remove node myprefix:schema/myprefix:complexType/myprefix:choice
  2. remove attribute
    myprefix:schema/myprefix:element/[@name]

programme :

  1. It is possible to dispose of Xpath Add to a list , Or read from a file
  2. Satisfy xpath Of node Maybe a lot , So we need a loop to process
  3. We have to deal with it at the same time node and attribute
  4. If you need to deal with a lot XML, Every XML To deal with the xpath It may be different , So try to skip Ben XML Mismatched in XPATH
  5. Want to get the pure xml Content , hold namespac Also clear

Complete code :

import re
import lxml.etree as XE
def ignore_xpath_handled_by_lxml(xml_content):
ignore_xpath_set = set()
ignore_xpath_set.add("myprefix:schema/myprefix:complexType/myprefix:choice")
ignore_xpath_set.add("myprefix:schema/myprefix:element/[@name]")
root = XE.fromstring(xml_content)
root_tag_name = re.findall(".*\}(.*)", root.tag)[0]
for xpath_ignore in ignore_xpath_set:
xpath_ignore_tag = xpath_ignore.split("/")[0].split(":")[1]
# reletive path
index = xpath_ignore.find("/")
rele_xpath_ignore = ".//" + xpath_ignore[index+1:]
# handle the xpath: mached the tag
if xpath_ignore_tag == root_tag_name:
try:
attri_name_list = re.findall(".*\[@(.*)\].*", xpath_ignore)
nodes = root.findall(rele_xpath_ignore, namespaces=root.nsmap)
if len(nodes) > 0:
for node in nodes:
if len(attri_name_list) > 0:
node.attrib.pop(attri_name_list[0])
else:
node.getparent().remove(node)
except Exception as e:
print("Error: {}".format(e))
else:
continue
root_tag = "myprefix" + ":" + root_tag_name
ignore_result = str(XE.tostring(root, pretty_print=True, encoding="unicode"))
namespace_pattern = re.compile('<' + root_tag + r' xmlns(.|\\s)*>')
content_without_namespace = re.sub(namespace_pattern, '<'+ root_tag + '>', ignore_result)
return content_without_namespace
if __name__ == "__main__":
xml_string = ''' <myprefix:schema xmlns:myprefix="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:B" xmlns:B="urn:B" elementFormDefault="qualified"> <myprefix:element name="foo"> <myprefix:complexType> <myprefix:element name="bar" type="B:myType"/> </myprefix:complexType> </myprefix:element> <myprefix:complexType name="myType"> <myprefix:choice> <myprefix:element name="baz" type="string"/> <myprefix:element name="bas" type="string"/> </myprefix:choice> </myprefix:complexType> </myprefix:schema> '''
new_xml_string = ignore_xpath_handled_by_lxml(xml_string)
print(new_xml_string)

Output :

<myprefix:schema>
<myprefix:element>
<myprefix:complexType>
<myprefix:element type="B:myType"/>
</myprefix:complexType>
</myprefix:element>
<myprefix:complexType name="myType">
</myprefix:complexType>
</myprefix:schema>

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved