Web scraping: infinite scrolling with selenium

in python3

zihan
4 min readMay 22, 2021

When diving into web scraping in python, the choice is inevitably between two libraries: beautifulsoup and selenium.
While beautifulsoup is quite practical to break down html pages, selenium, being a headless driver, can do this and much more: it can type search queries into google, scrape js scripts, loops through pages one by one, etc.

Moreover, it turns useful when web pages load content only upon scrolling.
Loading content upon scrolling speeds up the page loading time and offers a best user experience. But it is also have web scrapers get many headaches.
Here’s a solution.

This article is structured as follow:
1 — IMPORTING libraries
2 — SELENIUM setup
3 — fix INFINITE SCROLLING
4 — FREQUENTLY ENCOUNTERED PROBLEMS and fixes
5 — EXTRA: trigger js from within python

Image by Ryan McGuire from Pixabay

1 — IMPORTING libraries

from selenium import webdriver
from selenium.webdriver.common.keys import Keys import time

2 — SELENIUM setup

def get_selenium():                           
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
return (driver)

Here you can choose which browser to use.
Chrome offers more options than Firefox, so we will go with it.
The two flags ignore-certificate-errors and incognito can be omitted, while the headless argument is quite important.
When selenium runs headless in python, it will not open Chrome in a new window.
On the other side, if for any reason you encounter a problem while scraping, by commenting out the headless option, you will be able to see what’s going on in Chrome under the hood and what it is being loaded on the page.
Sometimes, you could stumble into a cookie banner or a captcha which is preventing your page to load, and you can then click ok and proceed to the page normally.
If the browser closes unexpectedly, use time.sleep() to pause the code and have time to debug.

3 — fix INIFINITE SCROLLING

Premise: to fix infinite scrolling, you will need to look into your page html structure. Html pages are all different, but the general idea is the same: you will need to find the last element loaded on the page, use selenium to scroll down to that element, use time.sleep() to wait for the page to load more content, scroll again to the last element loaded, and repeat. Until the end of the page.

Here is an example.

selenium = get_selenium()                           selenium.get("your/url")    
last_elem = '';
while True:
current_last_elem = "#my-div > ul > li:last-child"
scroll = "document.querySelector(\'" + current_last_elem + "\').scrollIntoView();"
selenium.execute_script(scroll) # execute the js scroll
time.sleep(3) # wait for page to load new content
if (last_elem == current_elem)
break
else
last_elem = current_elem

Here we are using jquery and javascript inside python.
Which is pretty cool.

. selenium.get() opens your url page.
If you need to add a keyword to your url search, you can do selenium.get("your/url.com/{0}".format(keyword)) .

. We initialise last_elem to 0

. We get the current_last_elem with the help of the CCS_selector or Xpath. To get the path, open your page, use webdev tools to select the element you need the path to (webdev tools usually opens by pressing F12, or google how), select the element in the page html structure and then right-click > Copy > CSS_selector. Here’s a tutorial.

.We use jquery and scrollIntoView() to scroll the page down to the element selected. Pay attention to all the single and double quotes you need here for the format to be correct, and to the escape characters too, i.e. "document.querySelector(\'" + .. + "\').scrollIntoView();"

. We run the js script with selenium.execute_script()

. time.sleep() is important.
If you don’t give the page enough time to load, it won’t find the last element, you will get undefined and it will stop scrolling.

. Every time we scroll, we check if a new last element is found.
If yes, then we have not reached the end of the page yet, and we can keep scrolling.
If not, the page has finished scrolling down and we can break out of the loop.

4 — FREQUENTLY ENCOUNTERED PROBLEMS and fixes

Finding the right xpath to the last element will take some time.
Be patient and double or triple check your single and double quotes in the js script.

If you are sure the path is correct but you still get undefined or always the same last element, try increase time.sleep(). The page might not have had the time to load completely.

If everything seems correct but it still does not work, comment out the headless option in get_selenium() to see what is going on under the hood.

5 — EXTRA: trigger js from within python

We have already seen that it is possible to trigger a js script from within python.
Here we are going to take this a step further and get a list.

Let’s say we want to get the sources from all the images on the page, that’s what we could do.

js_script = '''\                           
var jslist = []
document.querySelectorAll('img').forEach(i => jslist.push(i.src));
return jslist;
'''
python_list = selenium.execute_script(js_script)

With the help of js, we:
. create an empty array called jslist
. select all the img tags in the page
. use forEach to push each img.src into our array
. return the array
. ! pay attention to your '''

We could do the same for example for the links href, by selecting all the ‘a’ tags and pushing every a.href into our array. And so on and so forth.
We then run the script with selenium.execute_script() and the value returned by js is stocked into the python variable python_list .
And our scraping is done

--

--

No responses yet