Go to top

Scrapinghub and GSoC 2018

At Scrapinghub, we love open source and we know the community can build amazing things.

If you haven’t heard about it already Google Summer of Code is a global program that offers students stipends to write code for open source projects. Scrapinghub is applying to GSoC for the 5th time, and had participated in the GSoC 2014, 2015, 2016 & 2017. Julia Medina, our student in 2014, did an amazing work on Scrapy’s API and settings. Jakob de Maeyer, our student in 2015, did a great job getting Scrapy Addons off the ground.

If you’re interested in participating in GSoC 2018 as a student, take a look at the curated list of ideas below. Check the corresponding “Information for Students“ section and get in touch with the mentors. Don’t be afraid, we’re nice people :)

We would be thrilled to see any of the ideas below happen, but these are just our ideas, you are free to come up with a new subject, preferably around information retrieval :)

Let’s make it a great Google Summer of Code!

Scrapy Ideas for GSoC 2018

Scrapy and Google Summer of Code

Scrapy is a very popular web crawling and scraping framework for Python (15th in Github most trending Python projects) used to write spiders for crawling and extracting data from websites. Scrapy has a healthy and active community, and it’s applying for Google Summer of Code in 2018.

Information for Students

If you’re interested in participating in GSoC 2018 as a student, you should join the scrapy-users mailing list and post your questions and ideas there. You can also join the #scrapy IRC channel at Freenode to chat with other Scrapy users & developers. All Scrapy development happens at GitHub Scrapy repo.

Ideas

New HTTP/2 download handler

Advanced
Brief explanation

Develop a HTTP/2 compatible download handler, future-proofing and possibly accelerating Scrapy.

Expected Results

A HTTP handler that can gracefully upgrade to HTTP/2 where possible, and take advantage of the compression and efficiency gains of the new protocol.

Required skills Python, HTTP protocol
Difficulty level Advanced
Mentor(s) Daniel, Konstantin

Async & Await syntax support in Spiders

Intermediate
Brief explanation

Python now has syntax-level support for coroutine / concurrent programming, via the “async” and “await” keywords. We would like Scrapy to be able to make use of coroutine-based programming, and to be able to take advantage of asyncio and other modern async frameworks if we should migrate from Twisted at a future date.

Expected Results

Modifications to Scrapy that permit the use of Python’s native asynchronous keywords and features, without hindering backwards-compatibility to other still-supported versions of Python.

Required skills Python, asyncio
Difficulty level Intermediate
Mentor(s) Daniel, Cathal

Scrapy performance improvement

Intermediate
Brief explanation

We have a benchmarking suite with which we can profile portions of the Scrapy core. We propose a project that would offer measureable improvements to the performance of Scrapy’s individual parts, preferably those that are currently a bottleneck to Scrapy’s overall performance.

Expected Results

Measureable performance improvements to Scrapy’s components, and preferably improvements that filter up into Scrapy’s overall performance.

Required skills Python, Profiling, Algorithms, Data Structures
Difficulty level Intermediate
Mentor(s) Konstantin Lopukhin, Cathal

Scrapy spider autorepair

Advanced
Brief explanation

Spiders can become broken due to changes on the target site, which lead to different page layouts (therefore, broken XPath and CSS extractors). Often however, the information content of a page remains the same, just in a different form or layout. This project would concern the use of snapshotted versions of a target page, combined with extracted data from that page, to infer rules for scraping the new layout automatically. “Scrapely” is an example of a pre-existing tool that might be instrumental in this project.

Expected Results

A tool that can, in some fortunate cases, automatically infer extraction rules to keep a spider up-to-date with site changes. Preferably, these rules can be emitted as new XPath or CSS queries in log output to the user, so that they can incorporate these new rules in the spider for a more maintainable long-term fix.

Required skills Python, Algorithms, Testing, Data Structures
Difficulty level Advanced
Mentor(s) Cathal, Artur

Portia Ideas for GSoC 2018

Information for Students

If you’re interested in participating in GSoC 2018 as a student, contributing to Portia Ideas, you should join the portia-scraper mailing list and post your questions and ideas there. All Portia development happens at GitHub Portia repo.

Ideas

Increase Crawling Performance through page clustering

Intermediate
Brief explanation

With the rise of Angular and React web crawling has required the use of tools like Selenium, PhantomJS and Splash to render pages so that data can be extracted. Rendering pages like this can cause a crawl to take 10 times as long to complete. This project aims to use page clustering to examine pages before and after rendering and build up some rules depending on available data and links to decide if other similar pages should be rendered or not.

Expected Results

It should be possible to reduce the number of pages rendered for some crawls, reducing the time and bandwidth needed to extract data from a site.

Required skills Python, Clustering
Difficulty level Intermediate
Mentor(s) Ruairi Fahy, Cathal Garvey

Splash Ideas for GSoC 2018

Information for Students

Splash doesn’t yet have a mailing list, so if you’re interested in discussing any of these ideas, drop us a line via email at gsoc@scrapinghub.com, or open an issue on GitHub. You can also check the documentation at https://splash.readthedocs.org/en/latest/.

All Splash development happens at GitHub Splash repo.

Ideas