Skip to main content
Log inGet a demo

What Is Data Discovery and How Can It Benefit You

Find out how you can leverage data discovery to build trust and create better data products.

Craig Dennis.

Craig Dennis

February 21, 2023

9 minutes

data discovery
  • Source system changes: where an engineer could make changes at the source, such as data type or data format.
  • Data collection failure: where a source application is down or bugs have been introduced into the code.
  • Data ingestion failure: where a data pipeline has failed, so data hasn’t been transferre and becomes outdated.
  • Human errors: where someone has manually entered data incorrectly.
  • Implementing testing alerts you to these situations to ensure the reports stay accurate. Testing is also an important element of data observability.

    You could build the tests within your code, but extra time is needed to create them, and often they’re not as effective. This type of alerting can be switched off due to the multiple notifications that come through and the fact the errors aren’t that important. But what’s the point of creating tests if you’re just turning them off?

    A more common and better solution is dbt. dbt allows you to build tests for the reports that you produce, alerting you to any changes that may affect your results. You can build tests in dbt to how you want them.

    Take an example of sales data being entered into Salesforce. You know, on a daily basis, the sales team can enter up to ten errors out of the thousands inputted. You’re happy with the error rate being anywhere up to ten. Within dbt you can set different tolerances, so if it ever goes over the threshold, you can be alerted to investigate.

    Ongoing Maintenance

    There could be occasions when new datasets have been introduced into the company that could benefit the existing reports you’ve already created, or maybe a business process has changed how data is captured–thus skewing the results.

    Whatever it is, sometimes it pays to get back into the datasets and ensure everything is correct. This could be reviewing the graphs you’ve created or having a 15-minute chat with the executive you’ve made the reports for to determine if they have noticed any differences. With data analysis, you want to be confident of your results, regardless of whether the report is a day old or a year old.

    What are the Benefits of Data Discovery?

    The benefits of data discovery

    There are many clear benefits to carrying out data discovery, but ultimately it helps ensure you improve your data understanding, data quality, and confidence.

    Better Data Understanding

    Whether it’s fresh datasets you’re working on or ones you have been familiar with, it takes time to truly understand your data. Data discovery helps you to know how the data is collected, stored, and structured at the source so you can be aware of any quirky human or system behavior that could produce inconsistent results.

    After all, every database has its own unique schema structure that you need to be aware of. The more you understand the data you’re working on, the more accurate your assumptions will be.

    Better Data Quality

    Understanding the data you’re using helps to create better quality data. If you discover duplicates or missing data in your data discovery process, you can take action to exclude them or attempt to fill in any missing data.

    Conducting data discovery will reveal any areas that beyond surface level may not be correct and create a plan of action to remedy them so you can produce as accurate analysis as possible.

    Builds Confidence

    The work you do as a data analyst is complex. You need to ensure your data product is as accurate as possible, as it will be used to make informed decisions. Data discovery helps you to understand what could go wrong and enables you to identify discrepancies in your data so you can confidently trust the data you’re using.

    Data Discovery Examples

    Hearing some data discovery examples from real-life data practitioners can be helpful.

    I spoke to Meredith Alder, who’s been a data practitioner since 2003, about where data discovery helped her uncover data that was being modified outside of a business process.

    Meredith was working on a subscription model to identify the likelihood of customer renewals. In part of her research, she took information from the invoicing system to build a formula that determined how likely a specific user was to renew.

    Her results could have been totally wrong if she hadn’t carried out any data discovery. In understanding the business process, she discovered that if a customer called to cancel, people would go around the system to prolong the renewal by reworking the invoice. If Meredith had never found this out, she would be presenting insights that aren’t correct and influence the decisions taken by the actual business users.

    Erik Edelmann, who’s been a data practitioner since 2010, shared insights about his process for encountering new datasets and understanding them. Erik runs queries on the dataset to look for uniqueness and cardinality. This way, he can discover what makes each record unique and its identifying attributes. He analyzes the distributions of complex data sets to identify outliers and fields of interest.

    Final Thoughts

    When under pressure to deliver the reports the executive team wanted yesterday, it can be tough not to want to take shortcuts, and data discovery can be something that gets cut out. Hopefully, you now understand that if you want to be confident with the data you’re working with and trust the data product you’ve produced, data discovery is an important component of data analysis.

    More on the blog

    • What is Reverse ETL? The Definitive Guide .
  • Friends Don’t Let Friends Buy a CDP.
  • Snowflake

    Marketplace Partner of the Year

    Gartner

    Cool Vendor in Marketing Data & Analytics

    Fivetran

    Ecosystem Partner of the Year

    G2

    Best Estimated ROI

    Snowflake

    One to Watch for Activation & Measurement

    G2

    CDP Category Leader

    G2

    Easiest Setup & Fastest Implementation

    Activate your data in less than 5 minutes