LogoLogo
  • Planet 4
  • Development
    • Contribute
    • Installation
    • Git Guidelines
    • Coding Standards
    • Continuous Delivery
  • CI/CD
    • Test Instances
    • Deployment
    • Testing
      • End-to-end Tests
      • Visual Regression Tests
  • NRO Customization
    • Development
      • Using Child Themes
      • Package Registry
      • Plugins
    • Testing
      • Visual Regression Tests
    • Deployment
      • Production
      • DB/Media Sync
  • Infrastructure
    • NRO Generation
    • ElasticSearch
    • Cloudflare
  • Recipes
    • Maintenance page
    • Production sync
    • Running commands
  • Platform
    • Practices
    • ADRs
      • [ADR-0001] Use Gitbook for Technical Documentation
      • [ADR-0002] P3 Archive elastic search integration
      • [ADR-0003] WYSIWYG Blocks Architecture
      • [ADR-0004] Switch to Monorepo
      • [ADR-0006] Define scope for deployment environments
      • [ADR-0008] PSR-4 Autoloading Standard
      • [ADR-0009] Include Media Library in master theme
      • [ADR-0011] PHP Coding Standards
      • [ADR-0012] Use custom SCSS syntax for variables
      • [ADR-0013] Choose a ticketing system
      • [ADR-0014] Choose a testing framework
      • [ADR-0015] Use block templates to build block patterns
      • [ADR-0016] Form Builder data retention policy
      • [ADR-0017] Move blocks into the theme
    • Changelog
      • 2024
      • 2023
      • 2022
      • 2021
      • 2020
      • 2019
      • 2018
  • Tech
    • Wordpress
    • Blocks
    • Plugins
    • Hooks
    • Data migrations
    • CSS variables
Powered by GitBook
On this page
  • Decision Drivers
  • Considered Options
  • Decision Outcome
  • Links
Edit on GitHub
  1. Platform
  2. ADRs

[ADR-0002] P3 Archive elastic search integration

Regarding the P3 data export, crawling the P3 site and scraping P3 data looks the easiest option compared to fetching data from the P3 database.

Previous[ADR-0001] Use Gitbook for Technical DocumentationNext[ADR-0003] WYSIWYG Blocks Architecture

Last updated 4 years ago

  • Status: accepted

  • Deciders: Engineering Team

Technical Story:

  • The previous Greenpeace website application (P3) will be decommissioned soon (~end of march 2020).

  • We still want to enable accessing this content by archiving it and allowing searching that archive from the P4 websites.

  • It’s content has already been archived to , which has search functionality.

  • However, the archive API returns only 100 results in search API requests which also takes almost 10 seconds. It also has one collection for all NROs.

  • We have all P3 data under one archive collection (we proposed making per nro collection but that was somehow not feasible), so when we search archive content it returns a result for all P3 nro data with only 100 records at a time.

Decision Drivers

  • Tight timeline. P3 will be decommissioned at the end of March (Akamai) and the Datacenter will be closed at the end of May.

  • Making sure data is indexed in a persistent way.

  • Making sure it doesn’t slow down P4 search.

Considered Options

Use ElasticSearch to index the minimum dataset in order to have results displayed on P4 search. Results will be linked to the Internet Archive for the actual content. This requires two steps:

  1. Exporting P3 data (only a few fields like title, description, date etc.) into XML file format.

  2. Import P3 data into archive elastic search index and query that instead of the archive search API.

Exporting can be done with one of these three options:

  1. Get post index from Archive team (No response on it from archive team)

  2. Retrieve data from P3 database (The P3 database structure is unclear, database access is available with VPN only)

  3. Crawl P3 sites and scrape required P3 data into XML files (one by one for each NRO).

Decision Outcome

Regarding the P3 data export, crawling the P3 site and scraping P3 data (3rd option) looks the easiest option compared to fetching data from the P3 database.

Links

Related JIRA issue:

Infra ticket:

http://web.archive.org/
PLANET-4717
Handover doc by Dylan
PLANET-4823