Creating a Website Scraper using Gecko Driver (for Firefox) and Selenium (on Pop Os Linux)

First install python and pip:

sudo apt-get update
sudo apt-get install python3 python3-pip

Now lets setup a virtual environment

pip3 install virtualenv

You may get a warning about the path, add the path as follows:

sudo nano /etc/environment
sudo source /etc/environment

Now lets create the virtual environment and install selenium

mkdir -pv selenium-firefox/drivers
virtualenv .venv
source .venv/bin/activate
pip3 install selenium pandas

Now download and extract the latest gecko driver from https://github.com/mozilla/geckodriver/releases/

wget https://github.com/mozilla/geckodriver/releases/download/v0.29.1/geckodriver-v0.29.1-linux64.tar.gz
tar -xzf geckodriver-v0.29.1-linux64.tar.gz -C drivers/

Now lets create a sample script (a simple download-er):

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
import time

firefoxOptions = Options()

#firefoxOptions.add_argument("-headless")
driver = webdriver.Firefox(executable_path="./drivers/geckodriver", options=firefoxOptions)

#Navigate to the login page
driver.get("https://some-page/my-account/")

time.sleep(5)

#Login
username = driver.find_element_by_id("username")
username.clear()
username.send_keys("usernamehere")

password = driver.find_element_by_id("password")
password.clear()
password.send_keys("passwordhere")

persistLogin = driver.find_element_by_id("rememberme")
persistLogin.click()

time.sleep(5)

driver.find_element_by_name("login").click()

time.sleep(5)

#Head to assets page
driver.get("downloadurlhere")

condition = True
while condition:
	# loop body here
	try: 
		downloadList = driver.find_elements_by_id('download-single-form')
		
		for download in downloadList:
			download.submit()
			time.sleep(25)

		driver.find_element_by_css_selector(".next[value='next']")
	except:
		break
# end of loop

driver.quit()

Save as browser.py and run as follows:

python3 browser.py

Note that you can use pandas to do data manipulation if necessary.

2 Comments

susan Thakuri

sir i am unable to understand this line “You may get a warning about the path, add the path as follows:” and i am completely new to linux

June 26, 2021 Reply
- phoenix17
  
  A path is a series of indexes or directions to a given location on the file system. In this tutorial the environment paths are set in the file /etc/environment which can be loaded with the source command. An environment path is a path that is always available to the running process as if it were local to the directory.
  
  For example if a script is located at /var/temp1/t.sh and I ran t.sh on the local terminal when outside that directory it wouldn’t be able to find the file. However If I add /var/temp1 to the path, then t.sh will execute correctly provided it is executable.
  
  August 22, 2021 Reply

Creating a Website Scraper using Gecko Driver (for Firefox) and Selenium (on Pop Os Linux)

2 Comments

Leave a Reply Cancel reply