Welcome!

By registering with us, you'll be able to discuss, share and private message with other members of our community.

SignUp Now!
  • Guest, before posting your code please take these rules into consideration:
    • It is required to use our BBCode feature to display your code. While within the editor click < / > or >_ and place your code within the BB Code prompt. This helps others with finding a solution by making it easier to read and easier to copy.
    • You can also use markdown to share your code. When using markdown your code will be automatically converted to BBCode. For help with markdown check out the markdown guide.
    • Don't share a wall of code. All we want is the problem area, the code related to your issue.


    To learn more about how to use our BBCode feature, please click here.

    Thank you, Code Forum.

Python Using selenium and chromedriver to grab url

I am trying to grab one url from the log files in headless chrome. The problem I am having is sometimes I get just the url and other times I get additional characters either before or after the url thus the url doesn't work. I don't know why it works sometime and other times it doesn't. Is the browser_log variable getting data values added to it while my regex is parsing the url?
Code:
        import re, json
        from selenium import webdriver
        from selenium.webdriver.chrome.service import Service
        from selenium.webdriver.chrome.options import Options
        
        options = Options()
        options.add_argument("--headless")
        options.set_capability("goog:loggingPrefs", {'performance': 'ALL'})
        service = Service(executable_path="/home/alarm/project_pychrome/chromedriver")
        driver = webdriver.Chrome(service=service, options=options)
        driver.get(url)

        browser_log = driver.get_log('performance')
        regex = '(?=gin\",\"url\":\")*?https:\/\/.*?m3u8?.*?(?=\"},\"requestId)'
        url_hls = re.findall(regex, str(browser_log), re.DOTALL)
        link = url_hls[0]
 
I am trying to grab one url from the log files in headless chrome. The problem I am having is sometimes I get just the url and other times I get additional characters either before or after the url thus the url doesn't work. I don't know why it works sometime and other times it doesn't. Is the browser_log variable getting data values added to it while my regex is parsing the url?
Code:
        import re, json
        from selenium import webdriver
        from selenium.webdriver.chrome.service import Service
        from selenium.webdriver.chrome.options import Options
       
        options = Options()
        options.add_argument("--headless")
        options.set_capability("goog:loggingPrefs", {'performance': 'ALL'})
        service = Service(executable_path="/home/alarm/project_pychrome/chromedriver")
        driver = webdriver.Chrome(service=service, options=options)
        driver.get(url)

        browser_log = driver.get_log('performance')
        regex = '(?=gin\",\"url\":\")*?https:\/\/.*?m3u8?.*?(?=\"},\"requestId)'
        url_hls = re.findall(regex, str(browser_log), re.DOTALL)
        link = url_hls[0]
Hi there,
Hope you don't mind me asking... what is the goal of your application?
 
The issue you're facing may be related to the variability in log formats or the asynchronous nature of logging in the browser. Instead of relying on a regular expression, you might want to consider parsing the log entries as JSON and then extracting the URL.

Here's a modified version of your code:

Python:
import json
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")
options.set_capability("goog:loggingPrefs", {'performance': 'ALL'})
service = Service(executable_path="/home/alarm/project_pychrome/chromedriver")
driver = webdriver.Chrome(service=service, options=options)
driver.get(url)

browser_log = driver.get_log('performance')

# Iterate over log entries
for entry in browser_log:
    log_message = json.loads(entry['message'])['message']
    
    # Check if the log entry is a network response
    if 'Network.response' in log_message['method']:
        url = log_message['params']['response']['url']
        
        # Check if the URL contains 'm3u8' (modify this condition based on your needs)
        if 'm3u8' in url:
            link = url
            break  # Stop iterating if a matching URL is found

# Now 'link' contains the desired URL



This code iterates over the performance log entries and extracts the URL from the entries related to network responses. This approach is more robust than using a regular expression and should help you avoid issues with additional characters in the URL. Adjust the condition for checking the URL as needed based on your specific requirements.

Anyone Can Learn to Code! 550+ Hours of Course Content!
 
Thanks for the help. The problem with my original code is the variable was all on one line so if my regex matched two urls then it returned those plus everything in between. I figured json was what I needed but I didn't know how until your post.

I ended up with a keyerror issue since I got a match but there wasn't a corresponding key so the code would stop running. I dumped the json to a file to see the structure so I changed the code to the following. The file had request instead of response. I didn't need to check for the m3u since this hierarchal key value only had the proper url also the same with the if statement to check for Network response since I just needed to grap the value from the proper key. I tried using log_message.get() but this didn't work so I ended up using a try except to get by the keyerror.

Code:
        # Iterate over log entries
        for entry in browser_log:
                log_message = json.loads(entry['message'])['message']
                try:
                        link = log_message['params']['request']['url']
                except (TypeError, KeyError):
                        continue
 
Check that I had to make sure url had m3u

Code:
for entry in browser_log:
                log_message = json.loads(entry['message'])['message']
                try:
                        url = log_message['params']['request']['url']
                if 'm3u' in url:
                        url = link
                        break
                except (TypeError, KeyError):
                        continue
 

New Threads

Buy us a coffee!

Back
Top Bottom