Extracting Subsidiaries from SEC Filings using Jina AI and Llama 3.1

Aug 1, 2024

My curiosity often leads me down odd paths, and recently I had tasked myself with figuring out the subsidiaries of certain companies.

I learned about the SEC 10-K forms that US companies have to submit anually and the included Exhibit 21 - a mandadory disclosure of subsidiaries.

This HTML document can be found in SEC’s Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system and seems to have no consisent format so writing a manual parser would’ve been a PITA.

Luckily, I’d recently found Jina AI’s ‘Reader’ API and found it to be perfect for the task at hand.

This quickpost aims to demonstrate the rocket propellant with which developers have been equipped - especially for data extraction - during this new AI era.

The Challenge

These Exhibit 21 documents are typically in HTML format, but they vary widely in their structure and formatting. This inconsistency makes automated extraction difficult using traditional web scraping techniques. Here’s an example of what these documents can look like:

One Example Exhibit 21

Another Example Exhibit 21

The Solution

To tackle this challenge, I hacked a little solution that combines two technologies:

Jina AI’s Reader API: This tool converts HTML and other file formats into clean, readable text for language models. It’s particularly effective at stripping away unnecessary HTML tags and formatting, leaving just the content we need. Any young computer scientists dream.
Ollama (locally hosted Llama 3.1): Llama 3.1 is the latest and greatest in (kinda) open source models and I use it with some prompting to do the final text extraction and cleaning. It’s flexible enough to understand various ways companies might list their subsidiaries.
In the future I could improve this with something like @jxnl/instructor to really force structured output.

The Process

The extraction process follows these steps:

Input a URL of an Exhibit 21 document
Clean the HTML using Jina AI’s Reader API
Extract information using Ollama with a specific prompt
Output formatted JSON with the company name and subsidiaries

Here’s a simplified version of the core code:

import requests
import ollama
import json

def process_url(url, api_key):
    jina_url = f"https://r.jina.ai/{url}"
    headers = {'Authorization': f'Bearer {api_key}'}
    response = requests.get(jina_url, headers=headers)
    
    prompt = f'''
    <ex21>{response.text}</ex21>
    Extract the subsidiaries and return as JSON:
    {{
        "companyName": 'Full Company Name',
        "subsidiaries": ["Subsidiary 1", "Subsidiary 2"]
    }}
    '''

    ollama_response = ollama.chat(model='llama3', messages=[
        {'role': 'user', 'content': prompt}
    ])

    return json.loads(ollama_response['message']['content'])

Run like this:

python extract_subsidiaries.py https://www.sec.gov/Archives/edgar/data/1232524/000123252424000015/jazzq42023ex211.htm jina_<API-KEY>

Results

Let’s look at a couple of examples to see how this works in practice:

Example 1: Jazz Pharmaceuticals, Inc.

Input URL: https://www.sec.gov/Archives/edgar/data/1232524/000123252424000015/jazzq42023ex211.htm

Output:

{
    "companyName": "Jazz Pharmaceuticals, Inc.",
    "subsidiaries": [
        "Jazz Pharmaceuticals Ireland Limited",
        "Jazz Financing I DAC",
        "GW Pharma Limited",
        "Celator Pharmaceuticals Inc.",
        "Jazz Pharmaceuticals Research UK Limited",
        "Gentium S.r.l.",
        "Jazz Pharmaceuticals UK Holdings Limited",
        "Jazz Securities DAC",
        "Jazz Financing Holdings Limited",
        "Jazz Financing Lux S.à.r.l",
        "Jazz Pharmaceuticals International Limited",
        "Jazz Investments Europe Limited",
        "Jazz Capital Limited",
        "Jazz Pharmaceuticals UK Limited",
        "GW Pharmaceuticals Limited",
        "Cavion Inc"
    ],
    "url": "https://www.sec.gov/Archives/edgar/data/1232524/000123252424000015/jazzq42023ex211.htm"
}

Example 2: BioMarin Pharmaceutical Inc.

Input URL: https://www.sec.gov/Archives/edgar/data/1048477/000104847724000016/bmrn-2023xexx211.htm

Output:

{
    "companyName": "BioMarin Pharmaceutical Inc.",
    "subsidiaries": [
        "BioMarin Commercial Ltd",
        "BioMarin International Ltd"
    ],
    "url": "https://www.sec.gov/Archives/edgar/data/1048477/000104847724000016/bmrn-2023xexx211.htm"
}

Key Benefits

This approach offers several advantages:

Automation: It handles various HTML structures without manual intervention.
Scalability: While the examples show one URL at a time, it can be easily scaled to process multiple documents in parallel.
Adaptability: The AI can handle different ways companies might list their subsidiaries.
Flexibility: With minor adjustments to the prompt, it can extract other types of information from these documents.

Getting Started

If you want to try this yourself:

Sign up for a free Jina AI Reader API key. You’ll get 1,000,000 tokens to start with.
Install Ollama and download the llama3.1 model.
Use the code provided above as a starting point, making sure to handle API keys securely.

Conclusion

I know this is no gradiose demonstration of my coding skills however it really showcases the 10x tooling that this new AI era has ushered in for anyone willing to learn it. What would’ve taken me weeks to hand code with xpaths and beautifulsoup now took me an hour.

I’m glad I could demonstrates how AI tools can simplify complex data extraction tasks. By combining the power of Jina AI’s Reader for HTML cleaning and Ollama’s language model for interpretation, we can quickly and accurately extract structured data from otherwise challenging documents.

As always, when working with public data sources, be sure to comply with all relevant terms of service and usage guidelines. Luckily,