User:Jrubashk/ScraperWork

From Wikidata
Jump to navigation Jump to search
  1. Access Scraper - https://public-paws.wmcloud.org/70821524/PodcastData/
  2. Log-in https://hub-paws.wmcloud.org/hub/login?next=%2Fhub%2F
  3. Utilize normal Wikidata/Wikipedia/Wikimedia login.
  4. Utilize third section of JupyterLab (PAWS) interface. Make sure to add a folder for the project that you’re working on.
  5. Copy and paste the Python from the Access Scraper into an Python file. Save the Python file into the designated folder. This file will be used when running the Scraper.
    1. Look within the code to see what information needs to be inserted for individual pages. In this case it will be the Apple Podcast ID’s. If you are satisfied with the tags/information it will save it’s now necessary to capture the HTML information.
  6. Open Mozilla Firefox web browser. Access the given podcast’s Apple Podcast page. Access how ever many podcasts you’d like to be captured in the scraper. If this process is not included it will only be the initial, most recent episodes captured in this process. Once the desired amount of podcast episodes are made available, utilize inspector mode. This can be accessed via right clicking on the page. When the HTML is visible. Right click on the first or last line of the HTML code. Copy outer HTML code.
  7. Paste the outer HTML into a text file within PAWS. After saving this txt file insert the file name within the Python script.
  8. Run the Python script in the terminal. The line will be: Python [Python File Name]
    1. The file name should be _.PY
    2. Make sure the python file has the accurate Apple Podcast ID. If you forgot to alter it in the python script you can change it in the terminal after running the initial step.
    3. If the generating the CSV file names are not changed - it will append to the current episode list versus creating something new
    4. It will take a few moments for the Python script to create a corresponding CSV and JSON file.