Basic Usage of Beautiful Soup#
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
Node Selector#
To get the text inside a node, simply call the node's name and then call string
.
soup.a.string ---> Get the text of the first <a> tag, returns a string
soup.a ---> Get the HTML code of the first <a> tag (Tag type), can be nested
soup.a.attrs['class']
soup.a['class'] ---> Get the attribute; if the attribute is unique, returns a string; if the attribute is not unique, returns a list;
Related Selector#
- Child nodes
soup.a.contents ---> Returns a list
soup.a.children ---> Returns a generator
- Descendant nodes
soup.a.descendants ---> Returns a generator
- Parent node
soup.a.parent ---> Returns the parent node of the first <a> node
- Ancestor nodes
soup.a.parents ---> Returns a generator
- Sibling nodes
soup.a.next_sibling ---> Returns the next sibling node
soup.a.previous_sibling ---> Returns the previous sibling node
soup.a.next_siblings ---> Returns the following sibling nodes, generator type
soup.a.previous_siblings ---> Returns the preceding sibling nodes, generator type
The above methods return a single node, you can directly call attributes such as string
, attrs
to get their text and attribute content;
If the returned result is a generator of multiple nodes, you can convert it to a list, then extract a specific element and call attributes such as string
, attrs
.
Method Selector#
find_all(name, attrs, recursive, text, **kwargs)
: Find all elements that meet the conditions
name
: Query based on the node name, returns a list, Tag typeattrs
: Query based on attributes, returns a list, Tag type
soup.find_all(attrs={'id':'list-1'})
soup.find_all(id='list-1')
soup.find_all(class_='element') ---> 'class' is a Python keyword, so add '_'
text
: This parameter can be used to match the text of the node, the input can be a string or a regular expression object
soup.find_all(text=re.compile('link')) --> Returns a list of all node texts that match the regular expression
find(name, attrs, recursive, text, **kwargs)
: Find the first element that meets the conditions, usage is the same as find_all()
, returns a single Tag type
CSS Selector#
Call the select()
method and pass the corresponding CSS selector: soup.select('CSS selector statement')
returns a list, with elements of Tag type
1. Support nested selection
for ul in soup.select('ul'):
print(ul.select('li'))
2. Get attributes
for ul in soup.select('ul'):
content = ul['id']
content = ul.attrs['id']
3. Get text
for ul in soup.select('ul'):
content = ul.string ---> Get direct text
content = ul.get_text() ---> Get all text inside the node