• Guest, before posting your code please take these rules into consideration:
    • It is required to use our BBCode feature to display your code. While within the editor click < / > or >_ and place your code within the BB Code prompt. This helps others with finding a solution by making it easier to read and easier to copy.
    • Don't share a wall of code. All we want is the problem area, the code related to your issue.


    To learn more about how to use our BBCode feature, please click here.

    Thank you, Code Forum.

Python How to scrape quicker

simong1993

King Coder
Staff Team
Guardian
Hey all,

So i am scraping data from one website but i have to run proxy's, so i load a proxy, check it with 100 different headers and if one lets me in i get what i need and move onto the next one SIMPLES. Now for the issue, its so so slow and being that i can only use UK proxys its taken my pool from 9000 to 250 so that's having a bit of a impact. I thought maybe adding Concurrent to my script would work and it does but not enough, i am currently doing 4 pages a second and i have got 100,000 to do :S Now dont get me wrong, i have come from 1-2 pages a every 2-3 seconds so its getting better but i need some fresh ideas to speed it up :D
 

Antero360

King Coder
Hey all,

So i am scraping data from one website but i have to run proxy's, so i load a proxy, check it with 100 different headers and if one lets me in i get what i need and move onto the next one SIMPLES. Now for the issue, its so so slow and being that i can only use UK proxys its taken my pool from 9000 to 250 so that's having a bit of a impact. I thought maybe adding Concurrent to my script would work and it does but not enough, i am currently doing 4 pages a second and i have got 100,000 to do :S Now dont get me wrong, i have come from 1-2 pages a every 2-3 seconds so its getting better but i need some fresh ideas to speed it up :D
You might want to be careful with attempting to make the scrapes as fast as possible. Remember, you want to have as much delay between each request as humanly possible, so as to not trigger an IP ban. Yes, having proxies will definitely get around that issue, but you want to utilize the proxies without getting them flagged. Proxy switching after each request should help, but making sure that the proxy is up and running still takes a few seconds. You also want to take into consideration any specified delays in the robots.txt file...

Here's a great resource to guide you on your scraping journey
 

Ghost

King Coder
If you have the money to spend then you could try running your script from more premium IPs or set yourself up with a bunch of VPNs.
If you require that many proxies and you aren't able to scrape without them and you also don't have the money for option 1 then I would recommend just loading up as many copies of the script as you can, all starting with a different set of proxies.

How are you saving the data? Is that having an impact on your results? Are you ever scraping any pages more than once? Do you have any other ways to target data that perhaps could be faster? Can you share some of your code?
 

simong1993

King Coder
Staff Team
Guardian
I did find one proxy company that was amazing, but it would cost me around £300 a month plus server costs its just not feasible for me at the moment :(

At the moment i use Concurrent connections so i have 500 roateing proxy's, the script takes a url and a proxy, checks it with 500 different headers, makes sure its not captacha checked etc and if all is good returns the html for the rest of the script to disect.

I save the data via SQL but only if the data has changed, so the script loads up. pulls price, stock and ref from the database. it then checks the data it scrapes against what i have stored in an Array, if its changed then it uploads if it hasn't then it moves on. This is the quickest way i have found to do this :D

The page would be scraped once a day :)

no other ways i have covered every way possible :D

Python:
def scrape_url(url):
    headers = {}
    GoodToGo = 0
    
    for _ in range(NUM_RETRIES):
        try:
            ua = UserAgent(use_cache_server=False, verify_ssl=False)
            headers["User-Agent"] = ua.chrome
            urlcheck = "https://www.amazon.co.uk/dp/" + url[1]

            response = requests.get(urlcheck, proxies=proxies4, headers=headers, )

            if response.status_code in [200, 404]:
                soup = BeautifulSoup(response.text, 'html.parser')
                CheckingCaptcha = AmazonBotCheck(soup, urlcheck,url[1])
                if CheckingCaptcha[0] == False:
                    soup = BeautifulSoup(CheckingCaptcha[1], features="lxml")
                    try:
                        PriceCheck = soup.find('span', class_='a-color-price').text.strip()
                        GoodToGo = 1
                        ## escape for loop if the API returns a successful response
                        break
                    except:
                        pass
        except:
            response = ''

    ## parse data if 200 status code (successful response)

    if GoodToGo == 1:
        ASIN = url[1]
        DBPrice = url[2]
        PageChecking(soup, ASIN, urlcheck, DBPrice)

here is where the main action happens :D
 

Ghost

King Coder
I recommend saving to a different format other than SQL to begin with.
Saving each result (if it's different) forces you to check SQL and then insert / update SQL. That can add multiple seconds per result.

Personally I save my Python scraping results to CSV and then import to SQL later so that the time to scrape is not affected by my time to import.
 

simong1993

King Coder
Staff Team
Guardian
Are you sure Ghost, the issue with Excel or Open Office is corruption. I will admit my python has come along way since i did try that but all i kept doing was corrupting excel :S

The script is a bit more improved now though. So now i keep an array off the price/ stock. i compare my array to what was scraped and if its different it uploads. If not move on. I have found doing it this way has halved the server resources by quite a bit for the database so its working.

The bottle neck i am facing is the proxys/header combo. Its taking so long to get the data, to process it takes under a second but its trying to get a proxy/head combo that works
 

Ghost

King Coder
Well, it depends on how you are saving to SQL.
It's one thing to dump a lot of data into an SQL file or construct a new one. However, if you are actually inserting rows into a database one at a time (or even in batches) it can take longer than saving to a more local file. The main difference is that you will need to have a way to read the file and import it to the database later, or import it directly if the saved file is in proper format.

You might end up having the entire process take more time , but the crawling itself will be incredibly faster.


Are you sure Ghost, the issue with Excel or Open Office is corruption. I will admit my python has come along way since i did try that but all i kept doing was corrupting excel :S

The script is a bit more improved now though. So now i keep an array off the price/ stock. i compare my array to what was scraped and if its different it uploads. If not move on. I have found doing it this way has halved the server resources by quite a bit for the database so its working.

The bottle neck i am facing is the proxys/header combo. Its taking so long to get the data, to process it takes under a second but its trying to get a proxy/head combo that works

Are you forced to create a new header every time in the loop? Is it possible to construct headers ahead of time, save them somehow so that you don't have to do that in each loop?

As for the proxies, I think your best bet is to use some threads to have multiple going at once. If you want to get fancy, you could detect system usage and automatically add new threads of your crawler as long as RAM and CPU isn't above whatever threshold you set. However you do it, I recommend running your script many more times. You could even just run multiple instances of Python if you want to keep code as is and not add multithreading.

If you do that, I recommend having a way to stop processes rather easily. You need to make sure that you dont allow scripts to go rogue and continue to run when you need them to stop for whatever reason that may be.
 

simong1993

King Coder
Staff Team
Guardian
With SQL, so i take the whole database of what i need. I scrape, compare to the data i have got from the database that is in an Array. My script takes around a day to do 100,000 so it only connects to pull data once a day. If the price or stock is different then it will connect and update the database with what it has just found. if that makes sense :)

I have tried feeding only working proxys/header and it failed, so the issue was. I have 3 variations of the scraper running. One looks at in stock items, one looks at out of stock items, One gets just the price of the in stock items. its the best way i have found to do it. Scraper one has 50 concurrent threads doing that one def i showed above. Scraper two has 20 and scraper three has 50 so that's 120 threads at once and that's kinda the max for my 4 CPU Cores 160 GB Storage 8 GB RAM Server. It could handle double that but i like it idling at half and i am limited to 20 connections with my proxy at a time and at this amount of threads i am pushing it lol. When i made a separate script to get good headers and proxy's when it found one within seconds it had been used and blocked but then again this was before i knew about threading. if i know what proxys and headers work i wont need so many concurrent connections hhhhmmmm Ghost you may be onto something here
 
Top