- Beautiful Soup Tutorial
- Beautiful Soup - Home
- Beautiful Soup - Overview
- Beautiful Soup - Web Scraping
- Beautiful Soup - Installation
- Beautiful Soup - Souping the Page
- Beautiful Soup - Kinds of objects
- Beautiful Soup - Inspect Data Source
- Beautiful Soup - Scrape HTML Content
- Beautiful Soup - Navigating by Tags
- Beautiful Soup - Find Elements by ID
- Beautiful Soup - Find Elements by Class
- Beautiful Soup - Find Elements by Attribute
- Beautiful Soup - Searching the Tree
- Beautiful Soup - Modifying the Tree
- Beautiful Soup - Parsing a Section of a Document
- Beautiful Soup - Find all Children of an Element
- Beautiful Soup - Find Element using CSS Selectors
- Beautiful Soup - Find all Comments
- Beautiful Soup - Scraping List from HTML
- Beautiful Soup - Scraping Paragraphs from HTML
- BeautifulSoup - Scraping Link from HTML
- Beautiful Soup - Get all HTML Tags
- Beautiful Soup - Get Text Inside Tag
- Beautiful Soup - Find all Headings
- Beautiful Soup - Extract Title Tag
- Beautiful Soup - Extract Email IDs
- Beautiful Soup - Scrape Nested Tags
- Beautiful Soup - Parsing Tables
- Beautiful Soup - Selecting nth Child
- Beautiful Soup - Search by text inside a Tag
- Beautiful Soup - Remove HTML Tags
- Beautiful Soup - Remove all Styles
- Beautiful Soup - Remove all Scripts
- Beautiful Soup - Remove Empty Tags
- Beautiful Soup - Remove Child Elements
- Beautiful Soup - find vs find_all
- Beautiful Soup - Specifying the Parser
- Beautiful Soup - Comparing Objects
- Beautiful Soup - Copying Objects
- Beautiful Soup - Get Tag Position
- Beautiful Soup - Encoding
- Beautiful Soup - Output Formatting
- Beautiful Soup - Pretty Printing
- Beautiful Soup - NavigableString Class
- Beautiful Soup - Convert Object to String
- Beautiful Soup - Convert HTML to Text
- Beautiful Soup - Parsing XML
- Beautiful Soup - Error Handling
- Beautiful Soup - Trouble Shooting
- Beautiful Soup - Porting Old Code
- Beautiful Soup - Functions Reference
- Beautiful Soup - contents Property
- Beautiful Soup - children Property
- Beautiful Soup - string Property
- Beautiful Soup - strings Property
- Beautiful Soup - stripped_strings Property
- Beautiful Soup - descendants Property
- Beautiful Soup - parent Property
- Beautiful Soup - parents Property
- Beautiful Soup - next_sibling Property
- Beautiful Soup - previous_sibling Property
- Beautiful Soup - next_siblings Property
- Beautiful Soup - previous_siblings Property
- Beautiful Soup - next_element Property
- Beautiful Soup - previous_element Property
- Beautiful Soup - next_elements Property
- Beautiful Soup - previous_elements Property
- Beautiful Soup - find Method
- Beautiful Soup - find_all Method
- Beautiful Soup - find_parents Method
- Beautiful Soup - find_parent Method
- Beautiful Soup - find_next_siblings Method
- Beautiful Soup - find_next_sibling Method
- Beautiful Soup - find_previous_siblings Method
- Beautiful Soup - find_previous_sibling Method
- Beautiful Soup - find_all_next Method
- Beautiful Soup - find_next Method
- Beautiful Soup - find_all_previous Method
- Beautiful Soup - find_previous Method
- Beautiful Soup - select Method
- Beautiful Soup - append Method
- Beautiful Soup - extend Method
- Beautiful Soup - NavigableString Method
- Beautiful Soup - new_tag Method
- Beautiful Soup - insert Method
- Beautiful Soup - insert_before Method
- Beautiful Soup - insert_after Method
- Beautiful Soup - clear Method
- Beautiful Soup - extract Method
- Beautiful Soup - decompose Method
- Beautiful Soup - replace_with Method
- Beautiful Soup - wrap Method
- Beautiful Soup - unwrap Method
- Beautiful Soup - smooth Method
- Beautiful Soup - prettify Method
- Beautiful Soup - encode Method
- Beautiful Soup - decode Method
- Beautiful Soup - get_text Method
- Beautiful Soup - diagnose Method
- Beautiful Soup Useful Resources
- Beautiful Soup - Quick Guide
- Beautiful Soup - Useful Resources
- Beautiful Soup - Discussion
Beautiful Soup - Kinds of objects
When we pass a html document or string to a beautifulsoup constructor, beautifulsoup basically converts a complex html page into different python objects. Below we are going to discuss four major kinds of objects defined in bs4 package.
- Tag
- NavigableString
- BeautifulSoup
- Comments
Tag Object
A HTML tag is used to define various types of content. A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document.
Example
from bs4 import BeautifulSoup soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml') tag = soup.html print (type(tag))
Output
<class 'bs4.element.Tag'>
Tags contain lot of attributes and methods and two important features of a tag are its name and attributes.
Name (tag.name)
Every tag contains a name and can be accessed through '.name' as suffix. tag.name will return the type of tag it is.
Example
from bs4 import BeautifulSoup soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml') tag = soup.html print (tag.name)
Output
html
However, if we change the tag name, same will be reflected in the HTML markup generated by the BeautifulSoup.
Example
from bs4 import BeautifulSoup soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml') tag = soup.html tag.name = "strong" print (tag)
Output
<strong><body><b class="boldest">TutorialsPoint</b></body></strong>
Attributes (tag.attrs)
A tag object can have any number of attributes. In the above example, the tag <b class="boldest"> has an attribute 'class' whose value is "boldest". Anything that is NOT tag, is basically an attribute and must contain a value. A dictionary of attributes and their values is returned by "attrs". You can access the attributes either through accessing the keys too.
In the example below, the string argument for Beautifulsoup() constructor contains HTML input tag. The attributes of input tag are returned by "attr".
Example
from bs4 import BeautifulSoup soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml') tag = soup.input print (tag.attrs)
Output
{'type': 'text', 'name': 'name', 'value': 'Raju'}
We can do all kind of modifications to our tag's attributes (add/remove/modify), using dictionary operators or methods.
In the following example, the value tag is updated. The updated HTML string shows changes.
Example
from bs4 import BeautifulSoup soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml') tag = soup.input print (tag.attrs) tag['value']='Ravi' print (soup)
Output
<html><body><input name="name" type="text" value="Ravi"/></body></html>
We add a new id tag, and delete the value tag.
Example
from bs4 import BeautifulSoup soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml') tag = soup.input tag['id']='nm' del tag['value'] print (soup)
Output
<html><body><input id="nm" name="name" type="text"/></body></html>
Multi-valued attributes
Some of the HTML5 attributes can have multiple values. Most commonly used is the class-attribute which can have multiple CSS-values. Others include 'rel', 'rev', 'headers', 'accesskey' and 'accept-charset'. The multi-valued attributes in beautiful soup are shown as list.
Example
from bs4 import BeautifulSoup css_soup = BeautifulSoup('<p class="body"></p>', 'lxml') print ("css_soup.p['class']:", css_soup.p['class']) css_soup = BeautifulSoup('<p class="body bold"></p>', 'lxml') print ("css_soup.p['class']:", css_soup.p['class'])
Output
css_soup.p['class']: ['body'] css_soup.p['class']: ['body', 'bold']
However, if any attribute contains more than one value but it is not multi-valued attributes by any-version of HTML standard, beautiful soup will leave the attribute alone −
Example
from bs4 import BeautifulSoup id_soup = BeautifulSoup('<p id="body bold"></p>', 'lxml') print ("id_soup.p['id']:", id_soup.p['id']) print ("type(id_soup.p['id']):", type(id_soup.p['id']))
Output
id_soup.p['id']: body bold type(id_soup.p['id']): <class 'str'>
NavigableString object
Usually, a certain string is placed in opening and closing tag of a certain type. The HTML engine of the browser applies the intended effect on the string while rendering the element. For example , in <b>Hello World</b>, you find a string in the middle of <b> and </b> tags so that it is rendered in bold.
The NavigableString object represents the contents of a tag. It is an object of bs4.element.NavigableString class. To access the contents, use ".string" with tag.
Example
from bs4 import BeautifulSoup soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>", 'html.parser') print (soup.string) print (type(soup.string))
Output
Hello, Tutorialspoint! <class 'bs4.element.NavigableString'>
A NavigableString object is similar to a Python Unicode string. some of its features support Navigating the tree and Searching the tree. A NavigableString can be converted to a Unicode string with str() function.
Example
from bs4 import BeautifulSoup soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser') tag = soup.h2 string = str(tag.string) print (string)
Output
Hello, Tutorialspoint!
Just as a Python string, which is immutable, the NavigableString also can't be modified in place. However, use replace_with() to replace the inner string of a tag with another.
Example
from bs4 import BeautifulSoup soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser') tag = soup.h2 tag.string.replace_with("OnLine Tutorials Library") print (tag.string)
Output
OnLine Tutorials Library
BeautifulSoup object
The BeautifulSoup object represents the entire parsed object. However, it can be considered to be similar to Tag object. It is the object created when we try to scrape a web resource. Because it is similar to a Tag object, it supports the functionality required to parse and search the document tree.
Example
from bs4 import BeautifulSoup fp = open("index.html") soup = BeautifulSoup(fp, 'html.parser') print (soup) print (soup.name) print ('type:',type(soup))
Output
<html> <head> <title>TutorialsPoint</title> </head> <body> <h2>Departmentwise Employees</h2> <ul> <li>Accounts</li> <ul> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul> <li>Rani</li> <li>Ankita</li> </ul> </ul> </body> </html> [document] type: <class 'bs4.BeautifulSoup'>
The name property of BeautifulSoup object always returns [document].
Two parsed documents can be combined if you pass a BeautifulSoup object as an argument to a certain function such as replace_with().
Example
from bs4 import BeautifulSoup obj1 = BeautifulSoup("<book><title>Python</title></book>", features="xml") obj2 = BeautifulSoup("<b>Beautiful Soup parser</b>", "lxml") obj2.find('b').replace_with(obj1) print (obj2)
Output
<html><body><book><title>Python</title></book></body></html>
Comment object
Any text written between <!-- and --> in HTML as well as XML document is treated as comment. BeautifulSoup can detect such commented text as a Comment object.
Example
from bs4 import BeautifulSoup markup = "<b><!--This is a comment text in HTML--></b>" soup = BeautifulSoup(markup, 'html.parser') comment = soup.b.string print (comment, type(comment))
Output
This is a comment text in HTML <class 'bs4.element.Comment'>
The Comment object is a special type of NavigableString object. The prettify() method displays the comment text with special formatting −
Example
print (soup.b.prettify())
Output
<b> <!--This is a comment text in HTML--> </b>
To Continue Learning Please Login
Login with Google