Skip to content
Change the repository type filter

All

    Repositories list

    • [UNMAINTAINED] S3 Uploader pipelines for HTML and screenshots rendered by Splash
      Python
      1140Updated Mar 19, 2026Mar 19, 2026
    • Formasaurus tells you the type of an HTML form and its fields using machine learning
      HTML
      45121150Updated Mar 19, 2026Mar 19, 2026
    • Middleware that limits number of internal/external links during broad crawl
      Python
      2240Updated Mar 19, 2026Mar 19, 2026
    • A library to collect data from search forms
      Python
      3040Updated Mar 19, 2026Mar 19, 2026
    • autologin

      Public
      A project to attempt to automatically login to a website given a single seed
      Python
      Apache License 2.0
      41129135Updated Mar 19, 2026Mar 19, 2026
    • [UNMAINTAINED] A middleware that provides continuous site login facility
      Python
      1440Updated Feb 23, 2026Feb 23, 2026
    • Simple heuristic for measuring web page similarity (& data set)
      HTML
      169140Updated Feb 23, 2026Feb 23, 2026
    • web site for TREC Dynamic Domain
      HTML
      MIT License
      5100Updated Feb 23, 2026Feb 23, 2026
    • fortia

      Public
      [UNMAINTAINED] Firefox addon for Scrapely
      JavaScript
      4560Updated Feb 23, 2026Feb 23, 2026
    • [UNMAINTAINED] Deploy, run and monitor your Scrapy spiders.
      Python
      Apache License 2.0
      121230Updated Feb 23, 2026Feb 23, 2026
    • Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.
      Python
      Other
      212200Updated Feb 23, 2026Feb 23, 2026
    • Datawake

      Public
      Browser extension and backend services aimed at enhancing Internet search with domain specific knowledge, collaboration, and analysis.
      JavaScript
      Apache License 2.0
      5500Updated Feb 23, 2026Feb 23, 2026
    • A collection of example LUA scripts and JS utilities
      JavaScript
      3740Updated Feb 23, 2026Feb 23, 2026
    • sitehound

      Public
      This is the facade for installation and access to the individual components
      Shell
      Apache License 2.0
      61630Updated Feb 10, 2026Feb 10, 2026
    • Scrapy extension which writes crawled items to Kafka
      Python
      MIT License
      83150Updated Feb 10, 2026Feb 10, 2026
    • privoxy

      Public
      Privoxy HTTP Proxy based on jess/privoxy
      Dockerfile
      Other
      2730Updated Feb 10, 2026Feb 10, 2026
    • Show summary of a large number of URLs in a Jupyter Notebook
      Python
      MIT License
      71930Updated Feb 10, 2026Feb 10, 2026
    • Scrapy middleware that reads proxy config from settings
      Python
      MIT License
      4430Updated Feb 10, 2026Feb 10, 2026
    • tor-proxy

      Public
      a tor socks proxy docker image
      Dockerfile
      Other
      41230Updated Feb 10, 2026Feb 10, 2026
    • Sitehound's backend
      HTML
      Apache License 2.0
      4730Updated Feb 10, 2026Feb 10, 2026
    • Site Hound (previously THH) is a Domain Discovery Tool
      HTML
      Apache License 2.0
      92451Updated Feb 10, 2026Feb 10, 2026
    • A list of memex-related tools and their repository URLs
      MIT License
      57600Updated Feb 10, 2026Feb 10, 2026
    • Scrapy middleware which allows to crawl only new content
      Python
      MIT License
      218072Updated Feb 10, 2026Feb 10, 2026
    • fuzzyset

      Public
      A simple fuzzy matching set for python strings
      Python
      48100Updated Feb 10, 2026Feb 10, 2026
    • extract difference between two html pages
      HTML
      MIT License
      53350Updated Feb 10, 2026Feb 10, 2026
    • use multiple proxies with Scrapy
      Python
      MIT License
      158774485Updated Feb 10, 2026Feb 10, 2026
    • linkdepth

      Public
      [UNMAINTAINED] scrapy spider to check link depth over time
      Python
      1430Updated Feb 10, 2026Feb 10, 2026
    • memex-cdr

      Public
      This repository hosts code and schema information related to the Memex Crawl Data Repository (CDR)
      Python
      8100Updated Feb 10, 2026Feb 10, 2026
    • Log TensorBoard events without touching TensorFlow
      Python
      MIT License
      48629121Updated Feb 10, 2026Feb 10, 2026
    • THH ↔ deep-deep integration
      Python
      MIT License
      2320Updated Feb 10, 2026Feb 10, 2026
    ProTip! When viewing an organization's repositories, you can use the props. filter to filter by custom property.