Penny Arcade Scraper Fix

Bug #482055 reported by Ged Walsh
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Dosage
Fix Committed
Medium
Tristan Seligmann

Bug Description

http://www.penny-arcade.com/comic/ (They changed the image url and previous links)

class PennyArcade(BasicScraper):
    latestUrl = 'http://www.penny-arcade.com/comic/'
    imageUrl = 'http://www.penny-arcade.com/comic/%s/'
    imageSearch = compile(r'(?<!<!--)<img src="(http://art.penny-arcade.com/photos/.+?)"')
    prevSearch = compile(r'<a href="(/comic/.+?)/">Back</a>')
    help = 'Index format: yyyy/mm/dd'

Tags: comic

Related branches

Revision history for this message
Ged Walsh (bleedingheart) wrote :

Better fix.

class PennyArcade(BasicScraper):
 imageUrl = 'http://alienlovespredator.com/%s'
 imageSearch = compile(r'(?<!<!--)<img src="(http://art.penny-arcade.com/photos/.+?)"')
 prevSearch = compile(r'<a href="(/comic/.+?)/">Back</a>')
 help = 'Index format: yyyy/mm/dd'
 starter = indirectStarter('http://www.penny-arcade.com/comic/', compile(r'<a href="(/comic/.+?)/">Back</a>'))
 def namer(cls, imageUrl, pageUrl):
     num = pageUrl.split('/')[-3]
     ccc = pageUrl.split('/')[-2]
        ddd = pageUrl.split('/')[-1]
     return '%s-%s-%s' % (num, ccc, ddd)

This deliberately skips the latest image so it can name the strips in order.

summary: - Penny Arcade Module Fix
+ Penny Arcade Scraper Fix
Changed in dosage:
assignee: Jonathan Jacobs (jjacobs) → Tristan Seligmann (mithrandi)
importance: Unknown → Medium
tags: added: comic
Changed in dosage:
milestone: none → 1.7.0
Changed in dosage:
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.