Why Data Classification is Broken (and How ML Fixes It)

Insights

Elizabeth Nammour

April 11, 2025

It’s no secret that accurate data classification is the backbone of any data security program. However, as companies ingest more data across a growing mix of SaaS tools, cloud storage, and internal systems, valuable information becomes scattered and nearly impossible to track. This data sprawl makes classification daunting while introducing new risks surrounding non-compliance, security blind spots, and costly breaches.

We've seen how AI has revolutionized verticals such as healthcare, finance, and digital marketing, and that impact has also made its way into data security. With companies increasingly turning to Machine Learning (ML) for classifications, the antiquated manual processes of yesterday’s data teams are quickly becoming obsolete.

In this guide, we’ll dive into the inefficiencies of manual classification efforts and how ML-powered solutions (like Teleskope) are transforming data protection from a burden to a competitive advantage.

What is Data Classification?

Data classification is the process of identifying and organizing sensitive data (like PII, PCI, PHI, and secrets) across various formats and locations. It’s what enables teams to enforce access controls, apply security policies, and monitor risk effectively. Without it, sensitive information remains unaccounted for, unguarded and invisible.

Imagine an employee downloads a report containing customer names, emails, and partial payment details and then uploads it to a shared cloud folder to collaborate with another team. Without classification, that file may never be flagged as sensitive, potentially leaving customer data exposed and compliance teams in the dark.

While data classification is a decades-old practice, recent rises in cybersecurity threats and increasingly strict compliance standards (such as GDPR, HIPAA, and SOC 2) have reinforced its importance with companies across all verticals. And with regulations like CCPA, VCDPA, and the Colorado Privacy Act emerging across jurisdictions, the need for precise, automated classification has never been more urgent.

Why Traditional Approaches to Data Classification Fall Short

There are three commonly used (yet outdated) approaches to data classification: manual, rules-based, and keyword-based. Each has its own uses, but all three have one thing in common. They are rigid, limited in scope, and ill-suited for the scale and complexity of modern data environments.

Manual Classification

Manual classification is exactly what it sounds like: teams pull data into spreadsheets or legacy tools and rely on human judgment to label information, often referencing static documentation or internal knowledge. It’s a slow, resource-heavy process that doesn’t scale, and even minor mistakes can lead to significant security risks.

This approach is common in early-stage or smaller environments, where teams can manage data manually with limited resources and lower data volumes.

But that approach doesn’t hold up in today’s tech environment, where data flows through countless tools, formats, and integrations. With so many moving parts, manually classifying every document, ticket, or record becomes a full-time job — and one that’s prone to mistakes. And with 68% of data breaches traced back to human error, even small slip-ups can lead to serious security consequences.

“Manual classification works for small teams, but as they scale, it becomes impossible. They simply can’t keep up with the volume, velocity, and variety of modern data.” — Ivan Aguilar, Director of Data Science at Teleskope

‍

Rules-Based Classification

Rules-based classification uses predefined logic — like regular expressions, if/then conditions, or pattern-matching rules — to identify and label sensitive data. It works well in clean, structured environments where the data is predictable. Think of a database where Social Security numbers always follow the same format in the same column.

But the moment your data gets messy, rules start to fall apart. Unstructured inputs, shifting formats, or even just a typo can throw things off. For instance, if a rule is looking for a Social Security number in the format 123-45-6789, and a user types 123456789 or swaps a dash for a space, the system might miss it entirely. These systems depend on consistency, something real-world data rarely offers.

At scale, maintaining a growing set of rules becomes a time-consuming headache. And unlike ML-based solutions, these systems don’t learn or adapt; they simply do what they’re told, which often isn’t enough when data is constantly changing.

“Rules-based systems work well with structured data, but real-world data is anything but clean.” — Ivan Aguilar, Director of Data Science at Teleskope

‍

Keyword-Based Classification

Keyword-based classification looks for specific words or phrases — like “SSN,” “confidential,” or “password” — to flag potentially sensitive data. It gained popularity in the early 2000s as a quick, lightweight solution and became a staple in early Data Loss Prevention (DLP) tools offered by Symantec and Websense (now Forcepoint). No complex setup; just scan for certain terms and take action.

The problem? Keyword-based classification doesn’t understand crucial context. A keyword engine might catch “Washington,” but it can’t tell if you’re referring to a person’s last name, the state, or a sports team. That lack of nuance leads to high false positives, alert fatigue, and worse, missed risks. When your system treats every keyword match the same, critical threats can slip through while your team drowns in noise.

“You can’t rely on simple keyword matches for something as dynamic as sensitive data. Context matters. What looks sensitive in one situation might be totally irrelevant in another, and static keyword lists just can’t tell the difference.” — Ivan Aguilar, Director of Data Science at Teleskope

‍

None of these classification methods were designed with today’s data security landscape in mind. Data Security Posture Management (DSPM) and DLP tools rely on accurate, real-time insights to do their job. However, because they often rely on classification methods that lack contextual awareness and adaptability, these traditional approaches can’t deliver the precision required to protect sensitive data across dynamic, high-volume environments.

Plus, maintaining legacy classification systems at scale often becomes more expensive over time, requiring constant rule updates, manual oversight, and additional tools to fill in the gaps. That’s why more companies are turning to machine learning, not just to keep up but to get ahead.

Why ML-Based Data Classification is the Most Scalable Solution

Unlike the aforementioned classification methods, ML-based solutions learn from data patterns, adapt to new formats, and scale automatically. This makes them ideal for companies dealing with complex, ever-changing data environments.

Each ML-based classification method has its own advantages and limitations, and knowing when (and how) to use them is critical to building a system that performs reliably for your teams. Here’s a quick look at the four most common ML-based classification methods, including how they work, when to use them, and what to watch out for.

Large Language Models (LLMs)

Large Language Models (LLMs) are trained on massive text datasets to understand and generate human-like language. Tools like OpenAI’s GPT or DeepSeek’s R1 fall into this category, and they excel at extracting meaning from unstructured data formats like emails, chat logs, and support tickets.

Remember how keyword or rules-based methods can’t tell whether “Washington” refers to a person, place, or sports team? LLMs understand this context and classify different terms accordingly. This makes them especially effective in messy, unpredictable data environments where legacy classification methods fall short.

The trade-off? While LLMs offer exceptional accuracy and deep contextual understanding, they’re expensive to run, hard to fine-tune, and the exact reasoning behind specific outputs often remains opaque.

Still, LLMs are gaining traction in areas where context is critical. Support platforms use them to identify sensitive information in open-ended messages, and in healthcare, they help detect PHI buried in freeform patient data. The potential is massive — but getting it right requires thoughtful implementation and significant financial investment.

Small Language Models (SLMs)

Small Language Models (SLMs) are trained on focused datasets to handle specific language tasks. They share the same foundation as LLMs but operate on a smaller scale. For instance, SLMs are built on transformer-based architectures and use similar natural language processing techniques as LLMs, just with fewer parameters and a narrower training scope.

This scaled-down approach makes SLMs effective for specialized tasks where they can be tuned to detect specific patterns (e.g., PII in healthcare records or financial documents) without the overhead of a general-purpose LLM.

In the realm of ML-based classification methods, SLMs strike a valuable balance between context awareness and efficiency. Like LLMs, they understand natural language structure and meaning, but they’re more resource-efficient and easier to customize for specific domains such as healthcare, finance, or HR. Thanks to their smaller parameter space and task-specific training, SLMs offer greater transparency and explainability than LLMs — making it easier for teams to understand, audit, and trust their outputs.

While they may not match the raw power of LLMs in highly complex cases, SLMs are ideal for task-specific tuning — especially in use cases where speed, scalability, and cost-efficiency matter most.

Recurrent Neural Networks (LSTM / BiLSTM)

Recurrent Neural Networks (RNNs) — and especially their variants like Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) — were among the first neural architectures to achieve strong performance on language tasks like entity recognition and text classification.

Unlike simpler models, LSTMs can capture context over longer sequences, making them useful when sensitive data appears in varied or complex sentence structures. BiLSTMs go a step further by reading input both forward and backward, improving accuracy in identifying boundaries and meaning.

While transformer-based models like LLMs and SLMs have largely surpassed RNNs in accuracy and flexibility, LSTMs still offer value in domain-specific or resource-constrained environments, especially where legacy systems are in place or model transparency is a requirement.

Conditional Random Fields (CRFs)

Conditional Random Fields (CRFs) are graphical models historically used for structured prediction tasks like named entity recognition (NER). They work by modeling the relationships between labels in a sequence, making them ideal for tasks where the meaning of one token depends on its neighbors (e.g., classifying “New York” as a location rather than two separate entities).

Before deep learning became dominant, CRFs were widely considered state-of-the-art for text classification problems involving structured, sequential data. While they’re less common today, CRFs remain relevant in certain enterprise systems due to their interpretability and predictable performance on well-formatted inputs.

Why Teleskope Chose SLMs Over Other Approaches

At Teleskope, we experimented with a range of techniques — from large language models to deep ensemble architectures — to strike the right balance of accuracy, performance, and cost. In real-world enterprise environments, the trade-offs became clear.

SLMs consistently delivered the best combination of contextual understanding, low latency, scalability, and resource efficiency. Unlike heavier systems, they can be fine-tuned quickly, deployed widely, and audited easily — without sacrificing the accuracy required for sensitive data classification.

That’s why Teleskope uses SLMs as the foundation of our classification engine. We combine them with validation flows and rule-based logic to classify over 150 types of sensitive data with 99.3% accuracy at enterprise scale and speed. It’s a powerful, production-proven alternative to black-box LLMs and brittle rule systems — purpose-built for the complexity of today’s data ecosystems.

Enhancing ML Classification with Pipeline Optimization Techniques

While classification methods define how a model processes language, pipeline strategies help improve accuracy, consistency, and explainability across different types of inputs. Two of the most impactful techniques in this area are:

Deep Ensemble Methods

Deep ensemble methods combine the outputs of multiple ML models to improve the accuracy of data classification. Instead of relying on a single classifier, they run data through several specialized models and aggregate the results. The logic is simple: if multiple models agree, the prediction is more reliable.

This is especially valuable when dealing with highly sensitive data, where companies can’t afford the occasional mistakes other classification methods are known for. Ensembles reduce that risk by covering more ground.

For example, when classifying sensitive corporate documents, one model in the ensemble could specialize in detecting contract clauses in structured financial tables, while another analyzes email threads for confidential project discussions. By combining their strengths, the system accurately flags both spreadsheet-based account numbers and informally mentioned trade secrets, covering gaps that a single model might miss.

The benefit is clear: better precision, fewer false positives, and reduced blind spots. The trade-off? More computational and operational complexity. Running several models in parallel requires additional infrastructure and coordination, but for companies handling large volumes of sensitive data, the added assurance is often worth it.

At Teleskope, deep ensemble methods are used to classify structured, semi-structured, and unstructured inputs with greater confidence, especially in environments where errors can trigger regulatory consequences.

Adaptive Thresholding

Adaptive thresholding is an ML technique that dynamically adjusts how confident a system must be before labeling data as sensitive. Instead of applying a fixed threshold across the board, the system tunes its sensitivity based on context, such as the data type, source, or past accuracy.

In data classification, this helps reduce false positives and false negatives by making the system more flexible where needed. For instance, a platform like Teleskope might classify customer support tickets coming through Zendesk. If a phone number appears in a clearly labeled field, the system uses a high threshold to avoid over-flagging. But if the number shows up in the middle of a free-text message, the system can lower the threshold for that input type, increasing its chances of catching subtle risks without overwhelming the team with noise.

This flexibility matters. What’s risky in one scenario may be harmless in another. Adaptive thresholding helps models account for those differences, catching more edge cases without overwhelming security teams with false positives.

The result is balance: fewer false positives and a cleaner signal-to-noise ratio. The trade-off? It can often be difficult to trace or justify why the system adjusted its threshold for a specific classification decision.

Unlock Your Data Advantage with Teleskope

While many vendors offer ML-based classification, Teleskope combines the best of DSPM and DLP like no one else. Rather than layering AI on top of a legacy workflow, Teleskope reimagines classification entirely — bringing visibility and automated action together in one unified platform. The result is accurate, real-time detection of sensitive data across SaaS, cloud, and on-prem systems, paired with automated remediation that enforces your policies the moment risk appears.

Teleskope continuously scans structured and unstructured data across SaaS tools, cloud platforms, and on-prem environments. Whether it’s a support ticket, a database table, or a customer form, the platform classifies sensitive information in real time using a hybrid ML pipeline that blends small language models, rules-based logic, validation flows, and adaptive thresholding.

The result? Teleskope accurately classifies over 150 types of sensitive data across conversations, emails, documents, and other structured and unstructured sources — with throughput fast enough to process up to 40,000 bytes per second on our recommended infrastructure. No bottlenecks, no missed threats, and no LLM-scale compute costs.

But classification is just the beginning. Teleskope also empowers your teams with:

Automated remediation and policy enforcement, driven by your custom rules. Once sensitive data is identified, Teleskope can automatically redact, encrypt, mask, or restrict access without manual intervention. You have the ability to view, monitor, and approve enforcement actions in real-time.
Third-party risk mitigation, with real-time detection and enforcement across external SaaS tools and integrations.
Audit-ready compliance automation, with logs of every classification and action taken to support regulatory reviews and internal governance.
Out-of-the-box alignment with privacy laws like GDPR, CCPA, and HIPAA — including support for data subject rights, continuous monitoring, and customizable privacy workflows.

Leading companies like Ramp, GoFundMe, and The Atlantic rely on Teleskope to classify and secure sensitive data across their entire stack. Whether it’s customer records in internal systems or regulated data in third-party tools, Teleskope brings classification, policy enforcement, and remediation into one seamless workflow. With built-in integrations for platforms like Zendesk, teams can automate redaction and access controls right where data lives — without slowing down operations or adding complexity.

If you’re still struggling with manual, inaccurate, or unscalable classification or relying on multiple tools to do what one platform can handle, it’s time to rethink your data security stack. Book a call today to see how Teleskope can replace your patchwork of DSPM, DLP, and privacy tools with a unified solution built for the scale and complexity of modern data.

Introduction

Kyte unlocks the freedom to go places by delivering cars for any trip longer than a rideshare. As part of its goal to re-invent the car rental experience Kyte collects sensitive customer data, including driver’s licenses, delivery and return locations, and payments information. As Kyte continues to expand its customer base and implement new technologies to streamline operations, the challenge of ensuring data security becomes more intricate. Data is distributed across both internal cloud hosting as well as third party systems, making compliance with privacy regulations and data security paramount. Kyte initially attempted to address data labeling and customer data deletion manually, but this quickly became an untenable solution that could not scale with their business. Building such solutions in-house didn’t make sense either, as they would require constant updates to accommodate growing data volumes which would distract their engineers from their primary focus of transforming the rental car experience.

list
list
list
list

Continuous Data Discovery and Classification

In order to protect sensitive information, you first need to understand it, so one of Kyte’s primary objectives was to continuously discover and classify their data at scale. To meet this need, Teleskope deployed a single-tenant environment for Kyte, and integrated their third-party saas providers and multiple AWS accounts. Teleskope discovered and crawled Kyte’s entire data footprint, encompassing hundreds of terabytes in their AWS accounts, across a variety of data stores. Teleskope instantly classified Kyte’s entire data footprint, identifying over 100 distinct data entity types across hundreds of thousands of columns and objects. Beyond classifying data entity types, Teleskope also surfaced the data subjects associated with the entities, enabling Kyte to categorize customer, employee, surfer, and business metadata separately. This automated approach ensures that Kyte maintains an up-to-date data map detailing the personal and sensitive data throughout their environment, enabling them to maintain a structured and secure environment.

Securing Data Storage and Infrastructure

Another critical aspect of Kyte’s Teleskope deployment was ensuring the secure storage of data and maintaining proper infrastructure configuration, especially as engineers spun up new instances or made modifications to the underlying infrastructure. While crawling Kyte’s cloud environment, Teleskope conducted continuous analysis of their infrastructure configurations to ensure their data was secure and aligned with various privacy regulations and security frameworks, including CCPA and SOC2. Teleskope helped Kyte identify and fortify unencrypted data stores, correct overly permissive access, and clean up stale data stores that hadn’t been touched in a while. With Teleskope deployed, Kyte’s team will be alerted in real time if one of these issues surfaces again.

End-to-End Automation of Data Subject Rights Requests

Kyte was also focused on streamlining data subject rights (DSR) requests. Whereas their team previously performed this task manually and with workflows and forms, Kyte now uses Teleskope to automate data deletion and access requests across various data sources, including internal data stores like RDS, and their numerous third-party vendors such as Stripe, Rockerbox, Braze, and more. When a new DSR request is received, Teleskope seamlessly maps and identifies the user’s data across internal tables containing personal information, and triggers the necessary access or deletion query for that specific data store. Teleskope also ensures compliance by automatically enforcing the request with third-party vendors, either via API integration or email, in cases where third parties don’t expose an API endpoint.

Conclusion

With Teleskope, Kyte has been able to effectively mitigate risks and ensure compliance with evolving regulations as their data footprint expands. Teleskope reduced operational overhead related to security and compliance by 80%, by automating the manual processes and replacing outdated and ad-hoc scripts. Teleskope allows Kyte’s engineering team to focus on unlocking the freedom to go places through a tech-enabled car rental experience, and helps to build systems and software with a privacy-first mindset. These tangible outcomes allow Kyte to streamline their operations, enhance data security, and focus on building a great, secure product for their customers.

Introducing Policy Maker – Automate Data Governance and Loss Prevention

Classification engine identifies personal and sensitive information with unparalleled accuracy, and contextually distinguishes between.

Introducing Context Previews - Instantly view the context around data classifications

Classification engine identifies personal and sensitive information with unparalleled accuracy, and contextually distinguishes between.