How I'm coding a Python project for importing my articles from Ghost to Hashnode

The technical part.

How I'm coding a Python project for importing my articles from Ghost to Hashnode

What's the aim?

As I've written down in my previous post, I want to import some articles from Ghost to Hashnode. From my research online, I've found no direct way of doing this. I can however use the bulk markdown importer. But I first need to convert Ghost export file format from JSON to Markdown.

This is the git repository:

https://github.com/abdallahYashir/GhostToHashNode

Choices of technology

Like I previously debated, I chose Python mostly because it's easy to install on Windows and you can use the PyCharm Community version which is great.

How do I analyse the task?

The first is, to begin with, the end in mind. I want a list of markdown documents with a certain structure in a zip file. The fields required are:

  1. title
  2. date
  3. slug
  4. image (featured image)
  5. content

I got this from the sample document that you can download in the bulk markdown import menu.

--- 
title: "Why I use Hashnode" 
date: "2020-02-20T22:37:25.509Z" 
slug: "why-i-use-hashnode" 
image: "Insert Image URL Here" 
--- 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris hendrerit quam suscipit lectus mattis, sit amet vestibulum nisl porttitor. Ut sit amet blandit quam, eu placerat purus. Cras at convallis felis, non pellentesque lacus. In dolor libero, placerat ac blandit vel, pharetra nec libero. Sed vel arcu eget dolor accumsan pretium. Pellentesque quis odio euismod, ultricies urna ac, vulputate orci. Sed nec posuere ipsum. Fusce tincidunt arcu congue, consequat erat sed, finibus massa. In cursus lectus orci, ac vehicula neque interdum ut. Donec facilisis gravida leo. Nam nisl nunc, imperdiet at fringilla et, malesuada a eros. Aenean ut ullamcorper sapien, id pulvinar nibh. Phasellus est ipsum, ullamcorper quis fringilla vel, rutrum sed lacus. Integer a tincidunt purus. Morbi ornare non purus vel fringilla.

Even if there are lots of unknowns at the start of such endeavour, as a Software Developer for almost a decade, I have plenty of experience in Enterprise Data Interchange (EDI). The process is to transform a set of data from one system to another.Three are basically three steps:

  1. Export data from the source
  2. Clean and transform
  3. Import into the target

Inspecting the data

I first try to load the JSON file with more than 170 posts in Sublime Text and Visual Studio with JSON plugins. I want to glance at the structure and how to get all the required fields. However, they are too slow. I then switch to Excel which does the job well. This is for me one of the best tools on the market to manipulate data.

Where are the posts situated?

Db > 0 > data > posts

The first thing I note is that the initial posts are mostly in the draft status. This happens as I wrote down a few notes but preferred writing on something else. In fact, at the start of blogging on this platform, I was not sure on which subject to focus on. So I also think about filtering by published status. Which fields do I need from the file? I notice that all of them is in a post object. That's good for me as I don't manually need to stitch data together using foreign keys. I remember reading that Ghost uses MySQL database. Which data fields do I need from the JSON file?

  1. Title
  2. Slug
  3. Html (to convert to markdown)
  4. Plaintext
  5. Feature_image
  6. Published_at

Here is an example:

Now another interesting point is the HTML vs plaintext fields. What's the difference? Is plaintext enough? I tested this hypothesis by copying the content of my first post into the sample markdown document before uploading it.

The images are not displayed. Moreover, the plaintext version does not have the proper formatting. I then convert the HTML version to markdown using an online free tool. The formatting looks better, but I do need additional work for loading the image. At this point though, my aim is to write a script to upload the blog posts. If I have more time, I can improve it by getting the images to display.

Project Structure

I've not followed any structure per se. Only creating a run.py file, a process folder to hold files for the steps required to clean and transform the data. I've added a README file which I think of updating later on how to install and use the project.

The .idea is the folder created by PyCharm.I'm also trying to follow a TDD approach but as I don't know the domain well. It's a hit and miss so far. Additionally, I'm not well versed in Python unit testing libraries. The tests folder contains the tests and the test_data, files that represent a particular state.

I'm trying to use object-oriented programming for this small project. I can also use modules and/or Python packages.

Starting Point

The run.py file is the starting point. At this time 05/01/2022 10:38, this is the current work in progress. I'm still struggling to build a markdown file valid for Hashnode.The flow is simple.

  • import the JSON file
  • transform them
  • get the last 10 published posts
  • transform them from a dictionary to an object (this might not be needed)
  • generate the file with the fields needed for markdown front matter and the blog post content
  • finally zip the files
import zipfile 
from pprint import pprint 
from process.importing import Importing 
from process.transform import Transform 
file = Importing("../abdallah-yashir-blog.ghost.2021-12-26.json") 
transform_file = Transform(file.data) 
transform_file.get_list_of_posts() 
my_posts = transform_file.filter_posts(10, 'published') 
print(len(my_posts)) 
# pprint(my_posts[0]) 
ghost_posts = Transform.dict_to_object(my_posts) 
pprint(ghost_posts[0].title) 
single_post = ghost_posts[0] 
# Convert one file 
sample_file = transform_file.generate_file(single_post.title, single_post.published_at, single_post.slug, 
                                           single_post.feature_image, 
                                           single_post.html) 
pprint(sample_file.strip()) 
with open("output.md", "w") as text_file: 
    text_file.write("%s" % sample_file) 
    # convert sample file to a zip 
zipfile.ZipFile('posts.zip', mode='w').write("output.md")

Upcoming:

  • Fix the files for a valid markdown file
  • Loop them as a zip

Another way to run the files is through the tests. However, at this point, they are not complete.

Objects / Models

I create a class to represent the fields. I don't know exactly though how to deserialize the JSON post to this class.

class Ghost: 
    def __init__(self, id, title, slug, html, plaintext, status, visibility, feature_image, published_at, custom_excerpt): 
        self.id = id 
        self.title = title 
        self.slug = slug 
        self.html = html 
        self.plaintext = plaintext 
        self.status = status 
        self.visibility = visibility 
        self.feature_image = feature_image 
        self.published_at = published_at 
        self.custom_excerpt = custom_excerpt 
    def convert_html_markdown(self): 
        self.html = ""

This is for Hashnode. I'm not using it for the moment though.

class Hashnode:
    def __init__(self, title, date, slug, image):
        self.title = title
        self.date = date
        self.slug = slug
        self.image = image

Importing Class

The gist of importing the file is to first check for the valid file mime type which is JSON. Then open the file and load the JSON content into a dictionary. If there is no data, raise an exception that the file is empty.

def read_file(self):
    check_valid_file_format(self.path)
    self.file_content = open(self.path)
    self.data = json.load(self.file_content)
    if self.data is None:
        raise Exception("Empty JSON File")

I also forgot that import is a reserved word in Python. I, therefore, update it to Importing.

To improve: use the os.path.dirname() and os.path.join() so that the project works on any supported Operating System.

Transform Class

The importing class is more about cleaning the data and now I need to transform it so that I can use it. As the posts are deserialized into a dictionary, it's easy to get the posts.

def get_list_of_posts(self):
    db = self.data["db"]
    self.posts = list(db)[0]["data"]["posts"]
    return self.posts

I can also filter it to return the last amount of number and by status.

def filter_posts(self, number, status):
    filtered_posts = [post for post in self.posts if post['status'] == 'published']
    filtered_posts = filtered_posts[-number:]
    return filtered_posts

If you're struggling with Python and want to quickly learn and develop your skills, here's the bestselling course: