Help with a CI workflow!

Hi Friends - we have this issue that has been open for some time. it relates to running html proofer only on the files that already exist online. Right now if we create a new page in our website, guides, etc html proofer fails the build because it can’t find the link online (hence it’s a broken link). but really its just a link that doesn’t exist YET.

if anyone has time to help with this issue i’d greatly appreciate it. there are instructions for implementing it in the htmlproofer docs. but I haven’t had time to figure out how to implement it.

and i did ask chatGPT and it provided this solution which looks like it might work but i an worried it will still fail because it’s running HTML proofer first before identifying new pages and running it again (i think??)

if anyone has time to help us with this i’d greatly appreciate it!!

      - name: Check HTML using htmlproofer
        id: htmlproofer
        uses: chabad360/htmlproofer@master
        with:
          directory: "_site"
          arguments: |
            --external_only
            --ignore-urls "https://fonts.googleapis.com,https://fonts.gstatic.com,_site/_posts/README/index.html"
            --ignore-files "/.+\/_posts\/README.md"
            --ignore-status-codes "403, 503, 999"

      - name: Skip HTMLProofer for new pages
        run: |
          for file in $(git diff --name-only HEAD~1 HEAD | grep '^_site/.*\.html$'); do
            url=$(echo $file | sed 's/_site\(.*\)\.html/\1/')
            echo "Skipping HTMLProofer check for new page: $url"
            sed -i "/$url/d" $GITHUB_WORKSPACE/$id/htmlproofer_args.txt
          done
        env:
          id: htmlproofer

      - name: Re-run HTMLProofer after skipping new pages
        run: |
          cat $GITHUB_WORKSPACE/$id/htmlproofer_args.txt | xargs -0 chabad360/htmlproofer@master
        env:
          id: htmlproofer

Hi! I found this through Mastodon and thought I might be able to chip in something useful.

So, first of all, I believe you’re right that ChatGPT is tripping and what it suggested will not work. If that code came from a human, I would guess that the human was trying to run HTMLProofer the first time to generate a list of file names to be processed and store that list in htmlproofer/htmlproofer_args.txt, then remove any files that were changed in the most recent commit from that list, and then run HTMLProofer on the trimmed-down list of filenames. But that hypothetical human would have done an extremely bad job of it; like you said, it’s going to fail on the first step, and then there are a lot of things wrong with the rest of the code too.

Now, as far as what will work… I’m not sure I understand the situation, so I might be very wrong about this, but it sounds like you need to get a list of files that were added in the current pull request (assuming this is running in the context of a pull request), convert those file paths to URLs, and ignore those URLs. To get the files added, you could try this action or this one - neither is something I’m personally familiar with, but I did some searches and they showed up near the top :sweat_smile: (Alternatively, you can get it with a Git command - happy to provide more info on that if you want.) Then, you’d have to convert the file paths to URLs, join them by commas, and finally pass that string as the argument to HTMLProofer’s --ignore-urls option. I imagine it winds up looking something like this:

    steps:
      - id: files
        name: Get changed files
        uses: jitterbit/get-changed-files@v1
          with:
            format: json
      - name: Construct comma-separated list of URLs to ignore
        run: |
          <<<'${{ steps.files.outputs.added }}' jq -r 'convert_to_url | join(",")' >new-urls.txt
      - name: Check HTML using htmlproofer
        uses: chabad360/htmlproofer@master
        with:
          directory: "_site"
          arguments: |
            --ignore-urls "https://fonts.googleapis.com,https://fonts.gstatic.com,_site/_posts/README/index.html,$(<new-urls.txt)"
            --ignore-files "/.+\/_posts\/README.md"
            --ignore-status-codes "403, 503, 999"

You would of course have to replace convert_to_url in the jq program with something that actually turns the files into URLs. If you don’t know jq then I’d be happy to help “translate” a plain-language description into the right piece of code, when I have time (maybe later this week or next weekend). Or you could do it with something other than jq.

Feel free to continue this on the issue if that’d be better. I subscribed to get notifications there so I should see any activity on it.

1 Like

hey there @diazona !! gosh thank you for this post!
let me clarify what i need to have implemented. Essentially we need htmlproofer to ignore NEW files / url’s when it checks links. this is because a new link will NEVER be online and will always fail htmlproofer.

this section in the “documentation” of html proofer provides a way to do this. But we are using the htmlproofer action. I suppose another way to do this would be to install htmlproofer and use the approach in that readme?

does that make sense / help clarify? so i do think you’re on the right track in that we’d get a list of new files, parse the url’s of those files and ignore them.

chatgpt does really like to make things up :laughing:

1 Like

Gotcha, thanks for clarifying! I think what you’re describing lines up with my understanding of the problem, as long as each URL corresponds to a file in the repository. (I haven’t actually looked at the repository yet)

That section in the HTMLProofer README file is a little Ruby script that, as far as I can tell, does roughly the same thing I was describing in my post. (I don’t properly know Ruby, but I’ve seen enough to more or less follow what it’s doing.) If you wanted to use that approach, then yeah, you’d have to stop using the htmlproofer action and make your own action that runs the script - and you’d still have to modify the script to use the right commit IDs and such for your situation. I’d guess that’s more work than just passing a list of URLs on the command line. (At least, it is for someone such as myself who doesn’t know Ruby!)

chatgpt does really like to make things up :laughing:

lol yup… I always remind myself that it is just fancy autocompletion with no real understanding of what it’s doing