CFVideoScraper/CinemScraper/spiders/grabVideoData.py

# -*- coding: utf-8 -*-
import scrapy


class GrabvideodataSpider(scrapy.Spider):
    name = 'grabVideoData'
    allowed_domains = ['cinematheque.fr']
    start_urls = ['http://www.cinematheque.fr/decouvrir.html']

    def parse(self, response):
        for lien in response.xpath('//a/@href[contains(.,"video")]/../..'):
            url = response.urljoin(lien.css('a::attr(href)').extract_first())
            yield scrapy.Request(url, callback = self.parse_dir_content)

    def parse_dir_content(self, response):
    	for page in response.css("div#content"):
            yield {
                'titre' : page.css('h1::text').extract_first().strip(),
                'sous-titre' : page.css('h1 span::text').extract_first(),
                'description' : page.css('.description p').extract(),
                'biographies' : page.css('.biographies p').extract(),
                'videoSrcUrl' : page.css('iframe::attr(src)').re_first(r'\w[\w\.\/]+'),
                'articleUrl' : response.url,
                'tags'		  : page.css('.tag::text').re(r'[\n]')
    	    }
Premier commit 2018-05-10 17:31:59 +00:00			`# -- coding: utf-8 --`
			`import scrapy`


			`class GrabvideodataSpider(scrapy.Spider):`
			`name = 'grabVideoData'`
Scraping des liens et récupération des informations dans le même spider 2018-05-11 23:21:56 +00:00			`allowed_domains = ['cinematheque.fr']`
			`start_urls = ['http://www.cinematheque.fr/decouvrir.html']`
Premier commit 2018-05-10 17:31:59 +00:00
			`def parse(self, response):`
Scraping des liens et récupération des informations dans le même spider 2018-05-11 23:21:56 +00:00			`for lien in response.xpath('//a/@href[contains(.,"video")]/../..'):`
			`url = response.urljoin(lien.css('a::attr(href)').extract_first())`
			`yield scrapy.Request(url, callback = self.parse_dir_content)`

			`def parse_dir_content(self, response):`
Premier commit 2018-05-10 17:31:59 +00:00			`for page in response.css("div#content"):`
Scraping des liens et récupération des informations dans le même spider 2018-05-11 23:21:56 +00:00			`yield {`
Suppression des saut de ligne et espaces en début et fin de chaînes de caractère 2018-05-21 03:15:17 +00:00			`'titre' : page.css('h1::text').extract_first().strip(),`
			`'sous-titre' : page.css('h1 span::text').extract_first(),`
			`'description' : page.css('.description p').extract(),`
			`'biographies' : page.css('.biographies p').extract(),`
			`'videoSrcUrl' : page.css('iframe::attr(src)').re_first(r'\w[\w\.\/]+'),`
Ajout de l'URL du billet descriptif associé à la vidéo 2018-05-12 02:06:52 +00:00			`'articleUrl' : response.url,`
Suppression des saut de ligne et espaces en début et fin de chaînes de caractère 2018-05-21 03:15:17 +00:00			`'tags' : page.css('.tag::text').re(r'[\n]')`
Scraping des liens et récupération des informations dans le même spider 2018-05-11 23:21:56 +00:00			`}`