DOWNLOADER_MIDDLEWARES = { 'postscrape.middlewares.PostscrapeDownloaderMiddleware': 543, 'postscrape.middlewares.ProxyMiddleware': 1, }
head to the middlewares.py
create the class you specified in the DOWNLOADER_MIDDLEWARES as shown below:
class ProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = "httpproxy.yourproxy.name"
Test your crawler:
scrapy shell https://blog.scrapinghub.com/
Creating a spider inside the File.
We want to create a spider class inside spiders folder, to do this we can create a new file inside Spiders and give it a name, for example:
the name you give it at this stage will be used in spider when you call the spider to crawl, not the name of the class but the name you give as an argument inside the class.
start_url, will be the url your scrapy use to collect html.
parse method, is the method used to parse the html data for example in the parse method above, we loop through a specific scc class and get the header, summary and top news
to run this we can call
scrapy list -- this will show all of our spider classes if there are any issues with the class it will not show up here.
finally to actually run the file, we can use:
scrapy crawl bbcSpider
Now let's see how the full code looks like which would crawl bbc, and save the output in the location of choice in your pc via pandas data frame
settings.py
middlewares.py
bbc_spider.py