2022-08-31    Share on: Twitter | Facebook | HackerNews | Reddit

How to Convert HTML to Clean Markdown With Pandoc

I have collected source material in form of HTML pages that I would like to keep in one place as knowledge base (technically: create Obsidian vault for these pages). But first I needed to convert them all to Markdown. First tool to use that came into my mind was using Pandoc

I started with basic syntax to check what is my baseline:

pandoc -i index.html -o index.md

The text were converted to markdown but had some additional elements that wish not to see in the output markdown document. - remaining after div elements, iframes - references in curly brackets - raw HTML comments as codefences blocks

Here is an example of the clutter in my Markdown document.

::: iframe

::: site-container
::: site-header
::: wrap
::: title-area
[Page Title](../../../index.html)

::: {.widget-area .header-widget-area}
::: {#nav_menu-27 .section .widget .widget_nav_menu}
::: widget-wrap
-   [[Articles](../../../blog/index.html)]{#menu-item-2360}
-   [[Books ](../../index.html)]{#menu-item-8729}

::: site-inner
::: content-sidebar-wrap
::: {.content role="main"}
::: entry-header

-   "Happiness doesn't just flow from success, it actually causes it".

<!-- -->

Using --strict, -s mode adds YAML frontmatter with metadata - which contains few fields that I want to keep.

My final solution

To remove parts that remained in Markdown document you can use grep and sed

pandoc -s -i index.html -t markdown |\
grep -v "^:" |\
grep -v '^```' |\
grep -v '<!-- -->' |\
sed -e ':again' -e N -e '$!b again' -e 's/{[^}]*}//g' \
>! index.md

The sed is used to remove content in curly brackets spanning multiple lines:

# Linux
sed ':again;$!N;$!b again; s/{[^}]*}//g'

# macOS
sed -e ':again' -e N -e '$!b again' -e 's/{[^}]*}//g' file

solution by John1024 from: Linux Stack Exchange


You can further experiment with Markdown variants supported by pandoc.

In addition to pandoc’s extended Markdown, the following Markdown variants are supported:

  • markdown_phpextra (PHP Markdown Extra)
  • markdown_github (deprecated GitHub-Flavored Markdown)
  • markdown_mmd (MultiMarkdown)
  • markdown_strict (Markdown.pl)
  • commonmark (CommonMark)
  • gfm (Github-Flavored Markdown)
  • commonmark_x (CommonMark with many pandoc extensions)

Beyond pandoc

You can give a try to a dedicated python package for converting HTML to markdown: markdownify · PyPI - it has command line interface and support many options for the conversion.

See also:

Convert Markdown to PDF