BeautifulSoup Study Notes

Basic Usage of Beautiful Soup#

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

Node Selector#

To get the text inside a node, simply call the node's name and then call string.

soup.a.string   ---> Get the text of the first <a> tag, returns a string
soup.a    ---> Get the HTML code of the first <a> tag (Tag type), can be nested
soup.a.attrs['class']
soup.a['class'] ---> Get the attribute; if the attribute is unique, returns a string; if the attribute is not unique, returns a list;

Child nodes

soup.a.contents   ---> Returns a list
soup.a.children  ---> Returns a generator

Descendant nodes

soup.a.descendants  ---> Returns a generator

Parent node

soup.a.parent  ---> Returns the parent node of the first <a> node

Ancestor nodes

soup.a.parents  ---> Returns a generator

Sibling nodes

soup.a.next_sibling  ---> Returns the next sibling node
soup.a.previous_sibling  ---> Returns the previous sibling node
soup.a.next_siblings  ---> Returns the following sibling nodes, generator type
soup.a.previous_siblings  ---> Returns the preceding sibling nodes, generator type

The above methods return a single node, you can directly call attributes such as string, attrs to get their text and attribute content;
If the returned result is a generator of multiple nodes, you can convert it to a list, then extract a specific element and call attributes such as string, attrs.

Method Selector#

find_all(name, attrs, recursive, text, **kwargs): Find all elements that meet the conditions

name: Query based on the node name, returns a list, Tag type
attrs: Query based on attributes, returns a list, Tag type

soup.find_all(attrs={'id':'list-1'})
soup.find_all(id='list-1')
soup.find_all(class_='element')  ---> 'class' is a Python keyword, so add '_'

text: This parameter can be used to match the text of the node, the input can be a string or a regular expression object

soup.find_all(text=re.compile('link')) --> Returns a list of all node texts that match the regular expression

find(name, attrs, recursive, text, **kwargs): Find the first element that meets the conditions, usage is the same as find_all(), returns a single Tag type

CSS Selector#

Call the select() method and pass the corresponding CSS selector: soup.select('CSS selector statement') returns a list, with elements of Tag type

1. Support nested selection
for ul in soup.select('ul'):
    print(ul.select('li'))
            
2. Get attributes
for ul in soup.select('ul'):
    content = ul['id']
    content = ul.attrs['id']

3. Get text
for ul in soup.select('ul'):
    content = ul.string       ---> Get direct text
    content = ul.get_text()   ---> Get all text inside the node