From the course: Level Up: Advanced Python

HTML to markdown converter - Python Tutorial

From the course: Level Up: Advanced Python

HTML to markdown converter

(energetic music) - [Instructor] In this challenge, we're going to convert text from HTML to Markdown. We're going to implement a small portion of the html2markdown Python package. So let's go ahead and update html2markdown.py. So you want to convert to italics, so that's HTML em tags to asterisks. We want to convert consecutive spaces or line breaks to a single space. We want to convert to paragraphs, so that's HTML p tags to two line breaks. And finally, we want to convert URLs from HTML links to Markdown. You can also test your file using pytest and the test_html2markdown file in the test directory. Let me give you a couple of examples of what HTML to Markdown looks like. So if I use em tags for italics, I get this as the output, and if I have consecutive spaces, this is what your code should return. And finally, when using paragraphs, this is the output you should get. So go ahead and use the html2markdown as your starting point. Pause the video here and I'll show you the solution I came up with. (energetic music) Now, one way to think of this is as a pipeline of processing operations to be performed on your HTML text. I've decided to go with regular expression values and using the re.compile objects. For italics, we're using capturing groups to pick up anything with em tags and then inserting an asterisk on either side. Regular expressions are greedy by default, so if there were multiple em tags, it would pick up the text between the first opening em tag and the last closing em tag, including any opening and closing em tags in between. We don't want this, and so adding a question mark makes the regular expression non-greedy and picks up each of the item tags separately. For spaces, we are taking one or more white spaces and replacing them with a single space. And when we deal with paragraphs, we do so in a similar way to italics. For the URLs, we need two caption groups, and this is to capture the URL and the link. And finally, for the URLs, we need two caption groups, the URL and the link, and we use a backslash one and a backslash two to capture the URL and the link. And finally, we remove any leading or trailing white space and return the text. Let's check this passes all our tests. So let me make a little bit of room, so clear, pytest, test and the path to the test file, and you can see that we've got a green bar that suggests that we've passed all of the tests. Now, my solution is just one way of solving this problem. Go ahead and share your solution in the Q&A section. Now, just so you know, if you just post your code in the Q&A section as plain text, it won't be formatted, and so it'll be difficult to read. You might want to post the link to your code snippet using something like GitHub's Gist instead. I'd love to see your answer to this code challenge.

Contents