Simplifying Technical Documentation for Embedded Systems

Daniel Hensley, Co-Founder and CTO, Driver

Embedded systems are complex to understand and build. Embedded software is some of the most low-level and esoteric code in the software world.

In their work, embedded engineers have to navigate millions of lines of code across multiple languages and frameworks and with development histories that spans decades.

Furthermore, embedded systems are defined by tight coupling between multiple hardware components (e.g., microcontroller, peripheral sensors, FPGAs) and software components (firmware/drivers and application code) that must be understood together.

Documentation in the past and documentation in the future

To make sense of embedded systems, stakeholders from developers to customers depend on documentation. The problem? Traditional documentation methods simply can’t keep pace and hold everyone back.

Embedded monorepos and other components of embedded software stacks can include tens of millions of lines of code, multiple programming languages, complex interactions between components, decades of accumulated development, and continuous updates and changes.

If your team had to build new documentation for 30 million lines of code today, how would that work? With a team of three — five engineers, would this take months? Years? Would you not even try because it just doesn’t make sense? How would you make sure it stays up-to-date — another challenge that plagues documentation today?

That is the story and the calculus of the past. With the advent of today’s LLM technology, we can do things differently.

For the 30 million line codebase example — this would take months or years just to generate the text (let alone the time to ensure quality) with manual human-only methods. LLMs can produce this output in minutes or hours and keep it automatically up-to-date.

This is the story and the calculus of the future. These new order-of-magnitude changes allow us to transform how we do things — we can produce documentation in a new way, and rethink what we want it to be.

Challenges at scale

To take advantage of this opportunity, there is real work to ensure the consistency and quality of the output — that it can be trusted and work when applied to codebases of arbitrary sizes and shapes encountered in the wild.

Dealing with complexity and scale — regardless of the tools used — requires structure. In this context, we need to be able to work with assets that were not used to train the foundational model, and whose content is changing rapidly as developers update the codebase.

Retrieval augmented generation (RAG) methods shine in exactly this scenario. They provide the ability to semantically search through a knowledge base and provide the content relevant to answer a particular query at query time. But naive RAG approaches fall down in the face of large scale and complexity.

RAG is dependent on making sure relevant information is identified and passed into the LLM. If the information sent is not relevant and just noise, it is not possible for even the best LLMs to generate quality output. This is the critical information theory bottleneck in RAG performance. Because naive methods simply flatten a codebase into chunks for a flat semantic search, they erase critical structural information and hierarchies. While this can work for small knowledge bases, it will not for large scales, such as codebases with tens of millions of lines of code.

Building a new solution that works at scale

Drawing inspiration from signal processing and compiler design, Driver has developed what we call a source code to human language “transpiler.” We realized we can build structure on top of source code when we process it to overcome deficiencies in naive RAG. We bring significant static analysis and computer science tools together with LLMs. While LLMs are an important component, they are part of a much larger technology stack.

Conceptually, we view our methods as “pre-computing” important structure and information that ensures high quality RAG downstream, regardless of the size and complexity of the software. This system combines three key components to enable code comprehension at scale:

Directed Acyclic Graphs (DAGs): Every codebase, regardless of size or language, has an inherent file structure that can be represented as a DAG. This provides a universal starting point for analysis, guaranteed topological ordering for processing, and the ability to handle codebases of any size systematically.
Intermediate Representations (IRs): Instead of just directly chunking source code, Driver generates multiple layers of derived explanations or intermediate representations. We design our IRs to optimize downstream outcomes — automatically generating technical documentation and powering effective RAG. Key concepts in our IR generation:
- We can vary the abstraction level of IRs: At the lowest level, this includes describing individual symbols, lines, and file-level summaries. We can then build higher level module-level IR descriptions and codebase-wide architectural views that are informed by aggregating lower level IRs.
- Different Content Lengths: Individual IRs can be compressed to single-sentence summaries, paragraph-length explanations, structured formats (e.g., symbol documentation), or fleshed out as comprehensive multi-page documentation.
- Isolated Signals: We can split out different kinds of information into separate IRs. For example, at the file level, we can separate dependencies and imports, data structures, and functions. These can be aggregated upward and distilled into higher level IRs.
Multi-Pass Processing: Like a modern compiler, Driver’s system makes multiple passes over the code, building understanding iteratively. This enables us to pre-compute important information, progressively refine documentation, integrate different types of analysis, and generate higher-level insights.

The Transpiler sits at the core of our product. All incoming code assets are immediately processed by the transpiler. Tech Docs are automatically created from the computed IRs, and our RAG pipeline is configured to use our structured IRs to power dynamic and open-ended content generation in our Pages feature.

We continue to refine and improve our transpiler. Future advances include:

Advanced specialized IR generation: more language-specific specialization, increased static analysis integration, and more sophisticated signal isolation.
Expanded Scope: Application to PDFs and other technical documents, mixed model usage for specialized tasks, and fine-tuning for specific documentation needs.
Advanced Processing: Graph dependency tracking and visualization, more multi-pass optimizations, and high value codebase-wide/high level content generation such as system architecture diagrams.

Key takeaway: Structure matters

Structural methods are essential for comprehension at scale. While LLMs are powerful tools, they need to be supported by systematic approaches that preserve and leverage the inherent structure of code. The combination of classical computer science principles with modern AI capabilities provides a path to more effective tools for understanding and documenting complex software systems.

By treating documentation generation as a compilation problem and leveraging the power of LLMs within a structured framework, a system can be created to handle the scale and complexity of modern codebases while producing high-quality, consistent, and maintainable documentation.

This will be an important step forward in making complex codebases more accessible and understandable, potentially saving organizations thousands of engineering hours while improving code quality and collaboration. As these techniques continue to evolve, we hope to enter a new era where comprehensive, up-to-date documentation becomes the norm rather than the exception.

Pete Singer

Pete has over 40 years of publishing experience. He co-founded Semiconductor Digest and the Gold Flag Media company with publisher Kerry Hoffman in 2019. Previously, he spent over 25 years at Semiconductor International and 11 years at Solid State Technology.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_142332005_1	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
_pk_id.56353.85f6	1 year 27 days	This cookie is set by Google Analytics and is used to store a unique user ID for statistical purposes.
_pk_ses.56353.85f6	30 minutes	This cookie is created by Piwik PRO to store a unique session ID.
CONSENT	16 years 5 months 18 days 3 hours	These cookies are set via embedded youtube-videos. They register anonymous statistical data on for example how many times the video is displayed and what settings are used for playback.No sensitive data is collected unless you log in to your google account, in that case your choices are linked with your account, for example if you click “like” on a video.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick (which we don't use) and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
optin	1 hour	This cookie tracks users who take an affirmative action, such as checking a tick-box or another similar action. The consent is used for a variety of purposes, such as agreeing to terms and conditions, signing up for online content like newsletters and resources, consenting to the use of cookies, and more.
yt-remote-connected-devices	never	Stores the user's video player preferences using embedded YouTube video.
yt-remote-device-id	never	Stores the user's video player preferences using embedded YouTube video.

Simplifying Technical Documentation for Embedded Systems

Pete Singer

Featured Products

New Semiconductor Opens the “Eyes” of Advanced Industries

The High Tech Used to Upskill a High-Tech Workforce