Brendon Smith: Projects

Overview

The externalized mind

What one ought to learn is how to extract patterns! You don't bother to memorize the literature-you learn to read and keep a shelf of books. You don't memorize log and sine tables; you buy a slide rule or learn to punch a public computer! ... You don't have to know everything. You simply need to know where to find it when necessary.¹

In the prescient 1968 science fiction novel Stand on Zanzibar, author John Brunner envisioned the coming Information Age. Brunner realized that the keys to success would be externalizing and finding information.

There is simply too much information for us to remember in biological memory (our brains), so we write it down and externalize it. In the past, paper books were the medium of choice, and today, we delegate the storage of information to computing machines. We use computer memory to store information, and operating systems to interact with memory.

We also need search engines to find information stored in computer memory. Search engines have become particularly important on the World Wide Web. The early days of the World Wide Web featured search engines with page results manually curated and organized by humans, but it became impossible for humans to organize that much information. Humans wrote algorithms (computer instructions), such as Google's PageRank, to organize information more efficiently, and search engines to locate the organized information for users. Google's stated mission was to "organize the world's information and make it universally accessible and useful." We all use search engines on the World Wide Web now.

It's not always efficient or practical to search the entire World Wide Web for every single query. Instead of relying only on web search, it helps to store smaller amounts of information in notes. Notes apps enable users to collect and connect their notes, creating a personal information management system or "externalized mind." Information is retained and rapidly searched, and disparate information sources can be connected in new ways. Clipping articles into a notes app removes ads and formatting, stores the articles to read later, and helps readers assess sources over time. This is becoming more and more important, because we're presented with feeds that are manipulated by BUMMER algorithms.

The Evernote entrypoint

Back in 2011, Evernote helped me digitize the information in my life and start externalizing my mind. By 2019, after eight years of extensive use, I had 5500 notes that formed a valuable personal knowledge base. I had also experienced many bugs and frustrations over those eight years, and wanted to switch to a new notes app.

I wanted to ditch the Evernote XML format (sort of HTML-in-XML) and switch to a new notes app that supported Markdown, a plain-text format based on HTML, the language used to structure web pages. There are many notes apps that support Markdown.

Another key feature I was looking for was note links, also sometimes called "backlinks," "internal links," "Wiki Links" or "WikiLinks." Note links lead from one note to another within the app. Over time, note links create a mind map of the connections among your notes, and enable quick navigation within the app. I therefore wanted to preserve the many note links I had created over the years.

In plain-text, Evernote note links look like this:

<a
  style="color: rgb(105, 170, 53);"
  href="evernote:///view/6168869/s55/ef6f76d8-5804-486c-9259-43e80a1b0ff9/ef6f76d8-5804-486c-9259-43e80a1b0ff9/"
  >Title</a
>

All the notes apps I tried shared the same problem: Evernote note links weren't converted on import.

The Joplin jilt

I tried an app called Joplin. When importing my notes, note links were not converted from Evernote format to Joplin format. This was a deal-breaker for me. I asked about this on GitHub. The developer didn't even know about note links, and closed the issue because "Evernote doesn't export the note IDs."

The Evernote Developer API article on note links explains that relative links contain the note GUID. However, the .enex export doesn't attach the note GUID to the notes, so it's only possible to connect the GUID with the correct note within Evernote itself.

The Bear buoyance

One of the best apps I came across was Bear. I was happy with the features and user experience overall, but it still failed to convert Evernote note links into Bear note links. I reached out to Bear support about this. Support responded by saying Bear could not convert Evernote note links "due to different workflow and separate API that Bear and Evernote use." Pretty lame, but I wasn't deterred by this response.

After working with Bear a bit more, I was buoyed when I realized it has a key feature: note links, created by placing note titles within double brackets, like [[Title]]. Much as the Evernote source code matches up note GUIDs with the correct notes, the Bear source code locates the note using the title within double brackets, and links the note appropriately. This means that changing note titles could break the links. As of Bear 1.7, this isn't a problem, because Bear automatically updates note links when note titles change.

I decided to write a script that would convert note links from Evernote format to Bear format. I chose Python for this project, but for a comparable JavaScript/TypeScript implementation, see notable/dumper.

Scripting

Reading and writing files

I used the os module for working with file paths. File paths can be verified with os.path.exists(). Directories can be created with os.mkdir().

I used with context managers to work with file contents. The with statement was introduced in PEP 343, and provides context management for working with files. Within a with context, file operations can occur and be automatically concluded. To save output to a new file, I used a with statement, and included "x" for exclusive creation mode, which creates the new file. I also looked into fileinput, but decided against it because it didn't add much value beyond a simple with context.

Parsing Evernote exports

I decided to try out Beautiful Soup. Beautiful Soup is usually used to parse webpages, but Evernote notes are somewhat like webpages, so it could work.

Selecting a parser was not straightforward. Evernote XML is a blend of XML and HTML, and many parsers depend on a valid HTML document.

I first tried xml, the XML parser in lxml. I couldn't match links, and HTML elements were being read like <div> instead of <div>.

I had some success with the HTML parser in lxml. It was lenient enough to read the XML parts, and parse the document into valid HTML. The lxml parser made it easy to match Evernote note links:

I started by learning about the different kinds of filters.
I used an href attribute filter and a regular expression filter to identify Evernote note links. This method avoided replacing URLs that were not Evernote note links. The function looked like this:
```
def note_link(href):
    """Identify note link URIs using href attribute and regex"""
    return href and re.compile(r"evernote://").search(href)
```
I then passed the note_link() function into soup.find_all() to locate all note links, extracted the strings (note titles) from the links using the string argument, and replaced the links with the titles within double brackets.
```
for link in soup.find_all(href=note_link):
    string = link.string.extract()
    bear_link = link.replace_with(f"[[{string}]]")
```

I could then print the output and see the note links properly replaced within the notes. The full function looked like this:

def convert_links(file):
    """Convert links in .enex files to Bear note link format.
    """
    with open(file) as enex:

        def note_link(href):
            """Identify note link URIs using href attribute and regex"""
            return href and re.compile(r"evernote://").search(href)

        soup = BeautifulSoup(enex, "lxml")
        for link in soup.find_all(href=note_link):
            string = link.string.extract()
            bear_link = link.replace_with(f"[[{string}]]")
        with open(f"{os.path.dirname(file)}/bear/{file.name}", "x") as new_enex:
            new_enex.write(str(soup))

The problem was that, in order to parse the document, lxml was adding <html> and <body> tags and removing the CDATA sections. Trying to import files modified in this way crashed Bear. I had similar problems with html5lib. The html5lib parser also forces some formatting by adding comments around the XML portions.

I then went back to the default Python html.parser. The html.parser doesn't modify formatting, but because the Evernote export is not valid HTML, it's not able to parse the tags like lxml does.

The note body was located within the content tag. I considered selectively parsing the content tag with SoupStrainer.

At this point, I realized I could rely on re to match and replace regular expressions in the entire document, rather than trying to get Beautiful Soup to identify specific links.

Matching patterns

Regular expressions and `re`

The Python re package provides helpful features for regular expression ("regex") operations. It supports raw string notation (r"") so that regular expressions can be written in Python code without having to escape special characters.

Initially, I matched Evernote note links with re.compile(r'(<a href="evernote.*?>)(.*?)(</a>?)') and replaced them with re.sub(r'(<a href="evernote.*?>)(.*?)(</a>?)', r"[[\2]]", soup). The \2 was a backreference to the second capture group (the string inside the note link), which was the note title. The question mark was particularly important in the second capture group, (.*?). Without the question mark, Python will continue through the remainder of the string to the last occurrence of the third capture group (</a>?), rather than stopping at the first occurrence.

I used a second regular expression to strip H1 tags out of the notes. Many of my clipped news articles included the article title within <h1> HTML tags. In Bear, the note title itself serves as H1, so the additional H1 within the note was unnecessary. I could have completely deleted these H1 elements, but I decided to retain just the text in case I had made changes from the original article titles. Again here, I used a backreference to the second capture group to retain the text within the H1 tags. To run both the re.sub() operations, I just overwrote the object created from the first re.sub().

soup_sub = re.sub(r'(<a href="evernote.*?>)(.*?)(</a>?)', r"[[\2]]", soup)
soup_sub = re.sub(r"(<h1>?)(.*?)(</h1>?)", r"\2", soup_sub)

I then wrote the soup_sub object to a new file using a with context again.

with open(file) as enex:
    soup = str(BeautifulSoup(enex, "html.parser"))
    soup_sub = re.sub(r'(<a href="evernote.*?>)(.*?)(</a>?)', r"[[\2]]", soup)
    soup_sub = re.sub(r"(<h1>?)(.*?)(</h1>?)", r"\2", soup_sub)
    with open(f"{os.path.dirname(file)}/bear/{file.name}", "x") as new_enex:
        new_enex.write(soup_sub)

Success!

Updating regular expressions to catch all note links

The initial implementation converted most of the links in my Evernote exports, but missed some. Further inspection showed that Evernote was inserting additional attributes between the opening anchor tag a and the href, like:

<a style="font-weight: bold;" href="evernote:///">Title</a>

I updated the pattern matching behavior to accommodate additional attributes in the anchor tag. The regex only needed a minor update, from (<a href="evernote.*?>)(.*?)(</a>?) to (<a.*?href="evernote.*?>)(.*?)(</a>?).

No soup for you

After implementing regular expressions, I realized that I actually didn't need Beautiful Soup at all. I could just read the file as a string, modify the string, and then write the string to a new file. Simple! I made a few touch-ups, such as using pathlib instead of os for file path operations, but at this point, the code was ready to run.

Updating regular expressions for wiki links

Bear has made lots of progress since I first imported my notes. Note links have been updated with some new "wiki link" features:

Forward slashes reference headings within notes ([[note title/heading]])
Pipes configure aliases (different link titles) ([[note title|alias]])

If any notes have forward slashes or pipes in the titles, links to those notes need to escape (ignore) forward slashes and pipes to avoid conflicting with how they are used in Bear wiki links. Bear uses backslashes to escape characters in note links, so backslashes themselves also need to be escaped.

Previously, the script was simply formatting links by placing the note title (\2, the second capture group in the regular expression) inside double brackets, like [[note title]].

enex_contents_with_converted_links = re.sub(
    r'(<a.*?href="evernote.*?>)(.*?)(</a>?)', r"[[\2]]", enex_contents
)

The second argument to re.sub can also accept a "callable" (function or other object with a __call__ method). One way to pass a callable to re.sub is to use a lambda expression:

enex_contents_with_converted_links = re.sub(
    r'(<a.*?href="evernote.*?>)(.*?)(</a>?)',
    lambda match: f"[[{match.group(2).replace("\\", r"\\").replace(r"/", r"\/").replace(r"|", r"\|")}]]",
    enex_contents,
)

The lambda expression might be considered difficult to read. Let's move the callable to a separate function definition, with a leading underscore to indicate that the function is private (only for use within this script). As explained in the re docs, "If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single Match argument, and returns the replacement string." The function should therefore identify the second capture group within the Match argument, escape special characters, and return the formatted wiki link, like this:

def _format_note_link(match: re.Match) -> str:
    note_title = match.group(2)
    escaped_note_title = note_title.replace("\\", r"\\")
    escaped_note_title = escaped_note_title.replace("/", r"\/")
    escaped_note_title = escaped_note_title.replace("|", r"\|")
    return f"[[{escaped_note_title}]]"

We'll then reference the function as the second argument to re.sub.

enex_contents_with_converted_links = re.sub(
    r'(<a.*?href="evernote.*?>)(.*?)(</a>?)', _format_note_link, enex_contents
)

Note links will now be properly escaped.

Scripted

To download from GitHub:

git clone git@github.com:br3ndonland/el2bl.git
cd el2bl
python el2bl.py

Please input the path to a directory with Evernote exports: tests
Converting export.enex...
Converted export.enex. New file available at tests/bear/export.enex.

Here's the final script.

#!/usr/bin/env python3
"""el2bl: convert Evernote note links to Bear note links"""

import pathlib
import re


def _format_note_link(match: re.Match) -> str:
    """Format a note link for Bear.
    ---
    - Identify note title, assuming:
        - Title is in second capture group
        - Title in note link matches actual note title
        - Title is not already escaped
    - Format Bear wiki links (`[[note title]]`), escaping special characters:
        - Backslashes are escape characters (backslashes themselves should be escaped)
        - Forward slashes reference headings within notes (`[[note title/heading]]`)
        - Pipes configure aliases (different link titles) (`[[note title|alias]]`)

    https://docs.python.org/3/library/re.html
    https://bear.app/faq/how-to-link-notes-together/
    """
    note_title = match.group(2)
    escaped_note_title = note_title.replace("\\", r"\\")
    escaped_note_title = escaped_note_title.replace("/", r"\/")
    escaped_note_title = escaped_note_title.replace("|", r"\|")
    return f"[[{escaped_note_title}]]"


def input_enex_path() -> None:
    """Read .enex files in directory.
    ---
    - Accept path to directory from user input
    - Verify that directory is valid
    - Iterate over directory and convert links in each file
    """
    input_path = input("Please input the path to a directory with Evernote exports: ")
    path = pathlib.Path(input_path)
    try:
        if not path.is_dir():
            raise NotADirectoryError(path)
        for file in path.iterdir():
            if file.is_file() and file.suffix == ".enex":
                convert_links(file)
    except Exception as e:
        print(f"\n{e.__class__.__qualname__}: {e}")


def convert_links(enex_path: pathlib.Path) -> pathlib.Path:
    """Convert links in .enex files to Bear note link format.
    ---
    - Read contents of file
    - Replace Evernote note link URIs, but not other URIs, with Bear note links
    - Remove H1 tags from note body
    - Write to a new file in the bear subdirectory
    """
    print(f"Converting {enex_path.name}...")
    enex_contents = enex_path.read_text()
    enex_contents_with_converted_links = re.sub(
        r'(<a.*?href="evernote.*?>)(.*?)(</a>?)', _format_note_link, enex_contents
    )
    enex_contents_with_converted_links = re.sub(
        r"(<h1.*?>)(.*?)(</h1>?)", r"\2", enex_contents_with_converted_links
    )
    new_enex_dir = enex_path.parent / "bear"
    new_enex_dir.mkdir(exist_ok=True)
    new_enex_path = new_enex_dir / enex_path.name
    new_enex_path.write_text(enex_contents_with_converted_links)
    print(f"Converted {enex_path.name}. New file available at {new_enex_path}.")
    return new_enex_path


if __name__ == "__main__":
    input_enex_path()

Footnotes

From Stand on Zanzibar by John Brunner, Continuity 3 ↩