Python 3 - Mechanize and BeautifulSoup
[Updated December 2019]
Mechanize and BeautifulSoup are two essential modules for data acquisition.
However, Mechanize is only available on Python 2. But there's a way to use it with Python 3. I'll show you one solution.
If you’re using Python 3 and you want to use the module Mechanize to navigate through web forms, you’ll get this error :
Traceback (most recent call last):
File “/Users/michaelcaraccio/PycharmProjects/WebScraping/test.py”, line 3, in import mechanize
File “/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/mechanize/__init__.py”, line 119, in
from _version import __version__
ImportError: No module named ‘_version’
Unfortunately, Mechanize is incompatible with Python 3 : Support Python 3 #96.
But there's another way to make it works. You'll see it later.
Python 2 - Code example
Before giving you the answer, let’s see a working example, using BeautifulSoup and Mechanize. The following code describes how to connect your Twitter account and check if you’re connected :
import mechanize
from bs4 import BeautifulSoup
if __name__ == “__main__”:
URL = “https://twitter.com/login”
LOGIN = “yourlogin” # email login
PASSWORD = “yourpassword”
TWITTER_NAME = “twittername” # without @
# Create a browser object
browser = mechanize.Browser()
browser.set_handle_robots(False) # no robots
browser.addheaders = [(‘User-agent’, ‘Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36’)]
# Open webpage
browser.open(URL)
# Select the form
browser.select_form(nr = 1)
browser.form[‘session[username_or_email]’] = LOGIN
browser.form[‘session[password]’] = PASSWORD
response = browser.submit()
# Get response
userPage = BeautifulSoup(response, ‘html.parser’)
user = userPage.find(“a”, { “class” : “u-linkComplex” }).string
# Check if connected
if user == TWITTER_NAME:
print(“You’re connected as “ + user)
else:
print(“You’re not connected”)
If you want to try this code, change the following variables :
- LOGIN = “yourlogin” # email login
- PASSWORD = “yourpassword”
- TWITTER_NAME = “twittername” # without @
Python 3 - Solution
MechanicalSoup
As I said, Mechanize seems to be not maintained anymore. After some research I found this Module : MechanicalSoup
MechanicalSoup merged Mechanical and BeautifulSoup in the same Library and can be used since Python 2.6 through 3.4.
GitHub : MechanicalSoup.
Installation
With pip:
pip install MechanicalSoup
Or if you’re using PyCharm :
Preferences —> Project Interpreter —> Select your project —> Click on the + button —> Search MechanicalSoup and install it
Python 3 example (Updated december 2019)
After fixing my code with MechanicalSoup :
import mechanicalsoup # Don’t forget to import the new module
if __name__ == "__main__":
URL = "https://twitter.com/login"
LOGIN = "your_login"
PASSWORD = "your_password"
TWITTER_NAME = "displayed_name" # Displayed username on Twitter
# Create a browser object
browser = mechanicalsoup.StatefulBrowser()
# request Twitter login page
browser.open(URL)
# we grab the login form
browser.select_form('form[action="https://twitter.com/sessions"]')
# print form inputs
browser.get_current_form().print_summary()
# specify username and password
browser["session[username_or_email]"] = LOGIN
browser["session[password]"] = PASSWORD
# submit form
response = browser.submit_selected()
# get current page output
response_after_login = browser.get_current_page()
# verify we are now logged in ( get img alt element containing username )
# if you found a better way to check, let me know. Since twitter generate dynamically all theirs classes, its
# pretty complicated to get better information
user_element = response_after_login.select_one("img[alt="+TWITTER_NAME+"]")
# if username is in the img field, it means the user is successfully connected
if TWITTER_NAME in str(user_element):
print("You're connected as " + TWITTER_NAME)
else:
print("Not connected")
In this example I use StatefulBrowser() instead of Browser() to get the Javascript redirection from Twitter login page.
That’s it! Now you can log in to the website you want and start scraping!
Image credit
Green Tree Python by Ian C is licensed under CC BY-SA 2.0