Python reptile crash guide lets you quickly learn to write a simple reptile

The main content of this article: write the simplest crawler in the shortest time, can grab forum posts title and post content.

Audience of this article: Did not write new reptiles.

getting Started

0. Preparation

Something to prepare: Python, scrapy, an IDE or whatever text editing tool.

1. The Department of Technology has studied and decided that you write reptiles.

Feel free to build a working directory, then use the command line to create a project named miao that can be replaced with your favorite name.

Scrapy startproject miao

You will then get the following directory structure created by scrapy

Create a python file in the spiders folder, such as miao.py, as a crawler script.

The content is as follows:

Import scrapyclass NgaSpider(scrapy.Spider): name = "NgaSpider" host = "http://bbs.ngacn.cc/" # start_urls is the initial page we are going to crawl start_urls = ["http://bbs.ngacn.cc /thread.php?fid=406", ] # This is an analytic function. If it is not specified, the scraped page will be parsed by this function. # The processing and analysis of the page are performed here. In this example we simply print out the page content. Def parse(self, response): print response.body

2. Run a try?

If you use the command line like this:

Cd miao scrapy crawl NgaSpider

You can see that the reptile has already printed the first page of your altar interstellar zone. Of course, because there is no processing, the html tags and js scripts are printed together.

Resolve

Next we need to analyze the page we just grabbed and extract the title of this page from this html and js heap.

In fact, parsing pages is an individual effort. There are many methods. Only xpath is introduced here.

0. Why not try the magical xpath?

Take a look at what you just scratched, or open the page manually using the chrome browser and press F12 to see the page structure.

Each title is actually wrapped by such an html tag. for example:

[Cooperation mode] Assumption of cooperation mode modification

It can be seen that href is the address of this post (of course, the forum address should be spelled before), and the content of this tag is the title of the post.

So we use xpath's absolute positioning method to extract the part of class='topic'.

1. Look at the effect of xpath

Add a reference at the top:

From scrapy import Selector

Change the parse function to:

Def parse(self, response): selector = Selector(response) # Here, xpath will extract all class=topic tags, of course, this is a list # Each element in this list is the html tag we are looking for Content_list = selector.xpath("//*[@class='topic']") # Iterate through this list and process each tag for content in content_list: # Here parses the tag and extracts the post title we need. Topic = content.xpath('string(.)').extract_first() print topic # This extracts the URL of the post. Url = self.host + content.xpath('@href').extract_first() print url

Run it again to see the title and url of all the posts on the first page of your altar.

Recursive

Next we want to crawl the content of each post.

You need to use python's yield here.

Yield Request(url=url, callback=self.parse_topic)

This tells scrapy to grab the url, then parse the page with the specified parse_topic function.

At this point we need to define a new function to analyze the contents of a post.

The complete code is as follows:

Import scrapyfrom scrapy import Selectorfrom scrapy import Requestclass NgaSpider(scrapy.Spider): name = "NgaSpider" host = "http://bbs.ngacn.cc/" # Only one page is specified as crawl start url in this example # Of course it is OK to read the starting url from database or file or any other place. start_urls = [ "http://bbs.ngacn.cc/thread.php?fid=406", ] # The entrance of crawler can be here Perform some initialization work, such as reading the starting url from a file or database. def start_requests(self): for url in self.start_urls: # here to add the starting url to scrapy's queue to be crawled, and specify the analytic function # Scrapy will dispatch itself, and access the url and get back content yield Request(url=url, callback=self.parse_page) # Layout analysis function, parse the title and address of a post on the layout def parse_page(self, response): Selector = Selector(response) content_list = selector.xpath("//*[@class='topic']") for content in content_list: topic = Content.xpath('string(.)').extract_first() print topic url = self.host + content.xpath('@href').extract_first() print url # Here, the parsed post address is added Crawling the queue and specifying the analytical function yield Request(url=url, callback=self.parse_topic) # You can parse page flipping information here, so as to realize the parsing function for crawling multiple pages # posts of the layout area, parsing a post The content of each floor of def parse_topic(self, response): selector = Selector(response) content_list = selector.xpath("//*[@class='postcontent ubbcode']") for content in content_list: content = content. Xpath('string(.)').extract_first() print content # You can parse page flipping information here to crawl multiple pages of the post.

To this end, the crawler can crawl the title of all posts on the first page of your altar and crawl the content of each floor of the first page of each post.

The principle of crawling multiple pages is the same. Pay attention to the url address of page flipping, set the termination condition, and specify the corresponding page resolution function.

Pipelines - Pipelines

This is the processing of the captured and parsed content. It can be written to local files and databases through pipes.

0. Define an Item

Create an items.py file in the miao folder.

From scrapy import Item, Fieldclass TopicItem(Item): url = Field() title = Field() author = Field() class ContentItem(Item): url = Field() content = Field() author = Field()

Here we define two simple classes to describe the results we crawled.

Write a processing method

Find the pipelines.py file under the miao folder. Scrapy should have been automatically generated before.

We can build a treatment here.

From scrapy import Item, Fieldclass TopicItem(Item): url = Field() title = Field() author = Field() class ContentItem(Item): url = Field() content = Field() author = Field()

2. Call this handler in the crawler.

To call this method we only need to call in the crawler, for example, the original content handler function can be changed to:

Def parse_topic(self, response): selector = Selector(response) content_list = selector.xpath("//*[@class='postcontent ubbcode']") for content in content_list: content = content.xpath('string(. )').extract_first() ## Above is the original content ## Create a ContentItem object to put what we crawled into item = ContentItem() item["url"] = response.url item["content"] = content Item["author"] = "" ## Slightly ## This call is OK ## scrapy will give this item to the FilePipeline we just wrote to handle yield item

3. Specify this pipeline in the configuration file

Find the settings.py file and add it inside

ITEM_PIPELINES = { 'miao.pipelines.FilePipeline': 400, }

This is called in the crawler

Yield item

It will be handled by this FilePipeline. The following number 400 indicates the priority.

Multiple Pipelines can be configured here. Scrapy will hand items to each item according to their priority. Each processed result will be passed to the next pipeline for processing.

You can configure multiple pipelines like this:

ITEM_PIPELINES = { 'miao.pipelines.Pipeline00': 400, 'miao.pipelines.Pipeline01': 401, 'miao.pipelines.Pipeline02': 402, 'miao.pipelines.Pipeline03': 403, ## ... }

Middleware - Middleware

Through Middleware we can make some changes to the request information, such as common settings UA, proxy, login information, etc. can be configured through Middleware.

0. Configuration of Middleware

Similar to the configuration of the pipeline, add Middleware's name in setting.py, for example

DOWNLOADER_MIDDLEWARES = { "miao.middleware.UserAgentMiddleware": 401, "miao.middleware.ProxyMiddleware": 402, }

1. Break the website check UA, I want to change UA

Some sites are not accessible without UA.

Create a middleware.py under the miao folder

Import randomagents = [ "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5", "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14", "Mozilla /5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US ) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",]class UserAgentMidd Leware(object): def process_request(self, request, spider): agent = random.choice(agents) request.headers["User-Agent"] = agent

Here is a simple random replacement UA middleware, the contents of agents can be self-expanding.

2. Break the website closure IP, I want to use proxy

For example, the local 127.0.0.1 opens an 8123 port proxy, which can also be used by the crawler to crawl the target web site through the proxy.

Also added in middleware.py:

Class ProxyMiddleware(object): def process_request(self, request, spider): # fill in your own proxy here # If you bought the proxy, you can use the API to get the proxy list and randomly select a proxy = "http://127.0 .0.1:8123" request.meta["proxy"] = proxy

Many websites limit the number of visits. If the frequency of visits is too high, IP will be temporarily blocked.

If necessary, you can purchase IP from the Internet. The general service provider will provide an API to obtain the currently available IP pool. Select one to fill it here.

Some common configurations

Some common settings in settings.py

# Interval, in seconds. Specifies the interval between scrapy requests. DOWNLOAD_DELAY = 5# Whether to retry when access is abnormal RETRY_ENABLED = True# Retry RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 403, 404, 408]# retry times when the following http status code is encountered RETRY_TIMES = 5# Concurrency of Pipeline. How many Pipelines can be processed at the same time? ItemCONCURRENT_ITEMS = 200# The maximum number of concurrent requests CONCURRENT_REQUESTS = 100# The maximum number of concurrents for a website CONCURRENT_REQUESTS_PER_DOMAIN = 50# The maximum number of concurrents for an IP CONCURRENT_REQUESTS_PER_IP = 50

I just want to use Pycharm

If you want to use Pycharm as a development debugging tool, you can configure the following in the running configuration:

Configuration page:

Script to fill your scrapy's cmdline.py path, like mine is

/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py

Then fill in the name of the crawler in the Scrpit parameters, in this case:

Crawl NgaSpider

Finally is Working diretory, find your settings.py file and fill in the directory where this file is located.

Example:

Python reptile crash guide lets you quickly learn to write a simple reptile

Press the small green arrow to debug it happily.

reference

Here is a very detailed introduction to scrapy.

http://scrapy-chs.readthedocs.io/en_GB/0.24/

The following are some of the more important places:

Scrapy's architecture:

Http://scrapy-chs.readthedocs.io/en/0.24/topics/architecture.html

Xpath syntax:

Http://

Pipeline pipeline configuration:

Http://scrapy-chs.readthedocs.io/en/0.24/topics/item-pipeline.html

Middleware middleware configuration:

Http://scrapy-chs.readthedocs.io/en/0.24/topics/downloader-middleware.html

Settings.py configuration:

Http://scrapy-chs.readthedocs.io/en/0.24/topics/settings.html

Fiber Optic Splice Closure

CIXI LANGUANG PHOTOELECTRIC TECHNOLOGY CO..LTD , https://www.cxblueray.com