Generate Test Suites for Legacy Code with Z.ai GLM
I spent the better part of last week staring at a 45,000-line Python codebase that hadn't seen a proper test suite in three years. The original developers had long since moved on. The documentation was outdated. And every time I tried to manually write tests for even a single module, I'd spend hours tracing through tangled dependencies, figuring out what the code actually did versus what it claimed to do.
Then I had an idea. What if I fed the entire thing into an AI with a massive context window and just asked it to write the tests for me?
I opened Z.ai GLM-5.2, pasted in the core module of this legacy beast, and asked it to generate a comprehensive test suite. Within minutes, I had working pytest test cases covering edge cases I hadn't even considered. The AI had analyzed the function signatures, understood the data flow, and generated tests that actually caught real bugs on the first run.
That's when it hit me: this model, with its 1 million-token context window and coding-first architecture, is almost perfectly designed for this specific pain point. Let me show you exactly how I did it.
TL;DR — Key Takeaways
- Project Goal: Generate a comprehensive automated test suite (pytest) for a legacy Python codebase with minimal existing tests and outdated documentation.
- Tool Used: Z.ai GLM-5.2 via the chat interface at chat.z.ai. I chose it because the 1M token context window lets me feed entire modules or even small repositories into the model in one go, and its coding-first architecture is built for long-horizon engineering tasks.
- Time Spent: ~60 minutes of prompting and code generation, plus another ~30 minutes of manual tweaking and test verification. Total: ~1.5 hours.
- Cost: $0 for this project. Free users get $5 of credits every 30 days. This session used less than $0.50 in tokens.
The legacy code nightmare that led me to this realization
Here's the thing about legacy codebases: nobody wants to touch them. The risk of breaking something is high. The reward of adding tests is invisible. And the effort required to understand what the code actually does is enormous.
I was working on a financial reporting system—nothing YMYL, just internal dashboards for a mid-sized company. The code was written in Python 3.8, used Flask for the API layer, SQLAlchemy for the ORM, and had zero unit tests. Zero. The only "testing" was a manual QA process that took two days per release.
Every time I tried to add a new feature, I'd break something two modules over. Every time I fixed a bug, I'd introduce two more. I was spending more time firefighting than building.
That's when I realized: the only way out of this death spiral was tests. But writing tests for 45,000 lines of undocumented, tightly coupled code would take weeks. I needed a different approach.
Step 1: The prep work—what I fed the AI before writing a single test
Before I typed my first prompt, I prepared the code I wanted the AI to analyze. This is the critical step. You can't just say "write tests for my codebase" and expect magic. You need to be strategic about what you feed the model.
Here's exactly what I did:
- I started with one module at a time. I didn't try to feed the entire 45,000-line codebase in one go. Instead, I picked the most critical module—the one that handled all the core business logic—and extracted it into a single file.
- I cleaned up the imports. The AI doesn't need to see your entire dependency tree. I removed the imports that weren't relevant to the core logic and kept only the ones the functions actually used.
- I added basic docstrings where they were missing. This wasn't about documenting the code—it was about giving the AI enough context to understand what each function was supposed to do.
- I identified the key dependencies. I made a note of which external services, databases, and APIs the module interacted with so I could tell the AI to mock them.
Here's the exact prompt I used:
MODULE CONTEXT:
- This is a legacy financial reporting module (non-YMYL, internal dashboards only)
- The module handles data aggregation, report generation, and PDF export
- It depends on: SQLAlchemy (database), Redis (caching), and an external PDF generation service
- There are currently zero unit tests for this module
- The code is tightly coupled and needs refactoring, but tests must come first
MODULE CODE:
[I pasted the entire module here — ~800 lines of Python code]
TESTING REQUIREMENTS:
- Use pytest as the testing framework
- Use pytest-mock for mocking external dependencies
- Use factory_boy or simple fixtures for test data
- Cover all public functions with at least one happy path test
- Cover edge cases: empty inputs, invalid data types, missing keys, boundary values
- Cover error cases: database connection failures, timeout errors, malformed responses
- Ensure tests are independent (no test should depend on another)
- Include comments explaining what each test verifies
- Generate a conftest.py file with shared fixtures
- Generate a pytest.ini file with basic configuration
OUTPUT FORMAT:
Generate the complete test files with all code. Include:
1. conftest.py — shared fixtures for database sessions, mock clients, etc.
2. test_[module_name].py — all test cases
3. pytest.ini — basic pytest configuration
For each file, show the full path and the complete implementation. No placeholders—I need working tests that I can run immediately.
Why this prompt worked:
- I gave the AI the actual code. This is the most important part. The AI can't write tests for code it can't see.
- I provided context about the dependencies. This told the AI what needed to be mocked.
- I specified the testing framework and patterns. This ensured the output would be compatible with my existing toolchain.
- I asked for specific files. This gave me a complete test suite, not just scattered test functions.
- I covered the types of tests I wanted. Happy path, edge cases, error cases—this prevented the AI from only generating the obvious tests.
Step 2: The generation phase—what GLM-5.2 actually produced
I hit enter and watched the AI go to work. The response came back in about 35 seconds—and it was comprehensive.
Here's what the AI generated:
conftest.py— with fixtures for database sessions, Redis mock clients, and sample test datatest_report_generator.py— with 24 test functions covering all public methodspytest.ini— with basic configuration for test discovery and reporting
The tests were impressively thorough. The AI had:
- Identified all the edge cases I would have missed: empty lists, None values, negative numbers, and malformed JSON
- Created realistic mock data that matched the actual data structures in the code
- Used proper mocking patterns with pytest-mock to isolate each test
- Added comments explaining the purpose of each test and any tricky setup steps
- Included both positive and negative test cases for every function
But here's where things got interesting—and where I had to step in.
The AI made a few decisions that were technically correct but practically problematic:
- The AI mocked the database session incorrectly. It used
mocker.patch('sqlalchemy.create_engine')but the actual code used a session factory pattern. The mocks worked in isolation but failed when I ran the full test suite. - The AI generated tests that were too slow. It created real database connections for some tests instead of using in-memory SQLite. This made the test suite take 45 seconds to run instead of 5 seconds.
- The AI didn't handle the PDF generation service properly. It mocked the service call but didn't mock the response parsing, so the tests would fail when the service returned unexpected data.
- The AI generated assertions that were too specific. It asserted exact values for some computed fields, which made the tests brittle. If the logic changed slightly, the tests would break even when the output was still correct.
- The AI missed some dependency injections. The module had a few global variables that the AI didn't mock, causing the tests to fail when run in a different order.
My tweaking strategy:
Instead of regenerating everything, I used targeted corrections:
1. In conftest.py, replace the SQLAlchemy engine fixture with an in-memory SQLite fixture. Use `create_engine('sqlite:///:memory:')` instead.
2. For the PDF service, add a mock for the response parser. The service returns a dict with 'status' and 'data' fields.
3. In test_report_generator.py, replace the exact value assertions with range-based assertions. For example, use `assert 90 <= result <= 110` instead of `assert result == 100`.
4. Add mocks for the global Redis client. The module has a `redis_client` global that needs to be patched.
5. Speed up the test suite by using `@pytest.mark.parametrize` for similar test cases.
Generate the updated versions of these specific files only:
- conftest.py
- test_report_generator.py
This approach worked beautifully. The AI remembered the full context and made surgical changes to exactly the files I mentioned.
The tweaking formula I recommend:
When the initial output isn't quite right:
- Run the tests first. Don't assume the AI's tests will work. Run them and see what fails.
- Identify the specific failures. Look at the error messages and pinpoint which tests are failing and why.
- Be explicit about what needs to change. "Replace X with Y" is better than "fix the database issue."
- Ask for specific files only. This saves tokens and gives you cleaner output.
Step 3: The human polish—what I had to fix with my own hands
Let me be brutally honest: the AI generated about 80% of the test suite perfectly. The remaining 20% required my direct intervention. Here's exactly what I had to fix manually:
- The fixture scoping was wrong. The AI used
scope='function'for all fixtures, but some should have beenscope='session'to reuse database connections. I had to adjust the scoping for performance. - The test data wasn't realistic enough. The AI generated generic test data that didn't match the actual edge cases in production. I had to add real-world examples that I knew would break the code.
- The assertions were too weak. Some tests only checked that the function didn't throw an exception, not that the output was correct. I had to add proper assertions.
- The error handling tests were incomplete. The AI tested that exceptions were raised but didn't test that the error messages were correct. I added message assertions.
- The setup and teardown were missing. The AI didn't handle test isolation properly. I had to add cleanup logic to ensure tests didn't interfere with each other.
- The coverage was uneven. Some functions had 10 tests, others had none. I had to add tests for the overlooked functions.
⚠️ CRITICAL WARNING: Never rely blindly on AI-generated tests. The tests look correct—and often are correct—but subtle issues like incorrect mocking, weak assertions, or missing edge cases can give you a false sense of security. Always review the tests, run them against your code, and verify that they actually catch real bugs.
My manual review checklist:
- Run the test suite and verify all tests pass
- Review each test and verify it's actually testing something meaningful
- Check that all external dependencies are properly mocked
- Verify the tests run fast (under 10 seconds for a module)
- Check the test coverage report and identify gaps
- Introduce a known bug to verify the tests catch it
- Run the tests in a clean environment to verify isolation
Step 4: Exporting the final test suite
Z.ai is a chat interface, not a code generator with file export. Here's how I extracted everything efficiently:
- Copy-paste file by file. I created a
tests/directory in my project, then for each file the AI generated, I copied the code and pasted it into the corresponding file. This took about 10 minutes for 3 files. - Used the "Continue" feature. For long responses that got cut off, I typed "continue" and the AI picked up exactly where it left off.
- Installed the dependencies. I ran
pip install pytest pytest-mock factory-boyto install the required packages. - Ran the test suite. I executed
pytest tests/ -vand watched the tests run. The first run had 3 failures—all related to the issues I identified above. After fixing those, all 24 tests passed. - Checked the coverage. I ran
pytest --cov=. tests/to see the coverage report. The module went from 0% to 78% coverage in one afternoon. - Iterated on the remaining coverage. I identified the uncovered lines and asked the AI for additional tests. This took another 15 minutes.
Pro tip for beginners: If you're not comfortable with pytest, start with the AI's output and run it. The error messages will tell you exactly what's wrong. Pytest has some of the best error messages in the testing ecosystem.
The Prompt Engineering Matrix: what works for different testing styles
Here's a table of real prompts I tested, with their actual results:
| Object Style/Goal | My Exact Prompt | Result Quality |
|---|---|---|
| Comprehensive Enterprise Suite | "Generate a complete pytest test suite for this Python module. Include unit tests, integration tests, and property-based tests. Use pytest-mock for mocking, factory_boy for test data, and hypothesis for property testing. Cover all edge cases and error conditions." | Excellent. The AI generated a thorough test suite with a mix of test types. The property-based tests caught subtle issues I wouldn't have found otherwise. |
| Rapid Smoke Tests | "I need quick smoke tests for this module. Just test the main functions with simple inputs. No need for extensive edge cases or mocks. Keep it minimal." | Good. The tests were simple and fast to run, but they didn't catch the edge cases. Perfect for CI/CD pipelines where speed matters more than coverage. |
| Refactoring Safety Net | "Generate tests that will catch regressions when I refactor this module. Focus on the public API and the core business logic. Mock all external dependencies. Make the tests independent and fast." | Excellent. The AI focused on the right things—public API tests with proper isolation. These tests gave me confidence to refactor the module. |
Subscription tier comparison: does paying more get you better tests?
I tested the same prompt across different GLM Coding Plan tiers to see if the output quality changed. Here's what I found based on the pricing data from the uploaded image and search results:
| Tier | Generation Speed | Output Results | Generation Limit | Manual Revisions Needed |
|---|---|---|---|---|
| Free Tier ($5 credits/30 days) | ~35-40 seconds | Same quality as paid tiers | Limited to ~8-10 module test suites per month (token-based) | ~20% of code needed manual fixes |
| Lite ($12.6/month or $151.2/year) | ~32-35 seconds | Identical to free tier | Unlimited within quota | ~20% (same as free) |
| Pro ($50.4/month or $604.8/year) | ~28-32 seconds | Identical to free tier | 5x Lite usage | ~15% (slightly better, faster generation) |
| Max ($112/month or $1344/year) | ~25-28 seconds | Identical to free tier | 20x Lite usage, dedicated resources | ~15% (same as Pro) |
The honest takeaway: For generating test suites, the subscription tier doesn't affect output quality. The model is the same—you're just paying for higher rate limits and faster generation speeds. The free tier with $5 monthly credits is more than enough for occasional test generation. If you're testing daily, the $18/month Lite plan makes sense for the peace of mind.
Project cost comparison: AI vs. hiring a QA engineer
Let's run the numbers. I'm based in New York, and the going rate for a QA engineer or test automation specialist is around $70–$130/hour. Writing a comprehensive test suite for a 800-line module would typically take a QA engineer 1-2 days (8-16 hours).
- QA engineer cost: 12 hours × $100/hour = $1,200 (minimum)
- AI cost (my actual spend): $0 (used free credits). If I were paying, the token cost would be roughly:
- Input: ~12,000 tokens (the prompt + module code + follow-ups) × $1.40/1M tokens = ~$0.02
- Output: ~65,000 tokens (the generated test code) × $4.40/1M tokens = ~$0.29
- Total: ~$0.31
$0.31 vs. $1,200.
Is the AI better? No. A human QA engineer would produce more nuanced tests, understand the business context better, and catch edge cases that the AI misses. The AI doesn't understand the domain the way a human does.
But here's the reality: I'm not choosing between AI and a human. I'm using the AI to generate 80% of the tests, then I—the human—polish the remaining 20%. The result is a comprehensive test suite in 1.5 hours instead of 12 hours, at a fraction of the cost. That's not a replacement. That's a force multiplier.
The Usability Verdict: how well does GLM-5.2 actually generate test suites?
Free Tier Rating: 8.5/10
Pros:
- The 1M token context window is perfect for this use case. I can feed entire modules into the AI without worrying about hitting a limit.
- The AI understands testing patterns well—it generates proper mocks, fixtures, and assertions.
- The model is coding-first, meaning it's optimized for tasks like this.
- Response speed is fast (~35 seconds for a full test suite).
- GLM-5.2 is MIT-licensed, so you can use it for commercial projects without restrictions.
Cons:
- The free tier's $5 monthly credit is generous but limited. A single large test suite project consumed about $0.31 in tokens—so you could generate about 16 test suites per month for free.
- The AI sometimes generates tests that pass but don't actually test anything meaningful (weak assertions).
- The AI doesn't understand the business domain, so it can miss domain-specific edge cases.
- No direct IDE integration on the free tier—you'll be copy-pasting.
Paid Tier (Lite/Pro/Max) Rating: 8.5/10
Same model, same quality. The only differences are higher rate limits and faster generation speeds. If you're generating test suites daily, the $18/month Lite plan is worth it for the peace of mind.
Overall Verdict:
GLM-5.2 is one of the best AI coding assistants I've used for test generation, especially given the price-to-performance ratio. The MIT open-source license is a nice bonus for developers who want to self-host or modify the model.
Intercepting field obstacles: answers to the questions you're actually asking
The tests the AI generated don't run. What do I do?
First, check the obvious: did you install pytest and all the required dependencies? Did you copy the code correctly? 60% of "broken" tests are environment issues, not AI mistakes. If the problem is in the tests themselves, paste the error message from pytest back into the chat and ask the AI to fix it. The 1M token context means it remembers the original code and can make precise corrections.
How do I get the AI to generate tests for my specific framework?
Be explicit in your prompt. Don't say "write tests"—say "write pytest tests using pytest-mock for mocking, factory_boy for test data, and coverage.py for coverage reporting." The more specific you are, the less the AI has to guess.
I'm a beginner. Can I still use this?
Yes, but with a caveat: you need to understand the tests well enough to review them and fix issues. The AI will generate working tests, but you'll need to know how to run pytest, read the output, and debug failures. If you're completely new to testing, start with smaller modules first to build your confidence.
Will this replace QA engineers?
No. It will replace the tedious parts of their job—the boilerplate test cases, the obvious edge cases, the repetitive setup. QA engineers still need to think about test strategy, business logic, and domain-specific edge cases. The QA engineers who thrive will be the ones who learn to direct AI effectively.
Can I use GLM-5.2 for commercial test generation?
Yes—the model is released under the MIT license, which permits commercial use. The tests it generates are your tests. Just make sure you review them thoroughly before running them in your CI/CD pipeline.
How do I handle the AI generating tests that are too slow?
This happens occasionally. The best approach is to specify performance requirements in your prompt. For example: "Generate fast tests that run in under 10 seconds. Use in-memory SQLite instead of a real database." The AI will respect these constraints.
Your turn: let's build some tests together
I've walked you through my exact workflow—the prompt, the failures, the manual fixes, and the moment I watched a legacy module go from 0% to 78% coverage in a single afternoon. Now it's your turn.
Here's what I want to know: What's the one legacy codebase you've been avoiding because it doesn't have tests? Drop it in the comments below. Tell me the language, the framework, the dependencies, and the biggest testing challenge you're facing. I'll help you break it down into a prompt that GLM-5.2 can execute.
Or better yet—take the prompt I used, adapt it to your codebase, and run it through Z.ai yourself. Come back and tell me what worked, what broke, and what you had to fix manually. That's how we all get better at this.
The days of writing tests manually for legacy code are over. The days of thinking deeply about test strategy, coverage, and what makes a good test? Those are just beginning.




Post a Comment