(How to use knowledge graphs to mine explicit and latent relationships between data and non-data elements in analytics platforms, and enable deeper search and smarter recommendations.)
In organizations with large analytics implementations (think more than 5000+ reports & documents across 50+ projects) the biggest challenges relate to user adoption and proliferation of similar content.
In order to improve the reach of analytics content and insights, users need to be able to easily and quickly find content that they are interested in. A smarter system will also recommend adjacent and relevant content.
Traditionally, BI teams have cracked this problem in the following ways:
Training: teach the users where the content is.
Library: set up the content in business-friendly subject areas (folders).
Basic OOTB Search.
The first two approaches fail fast when the number of users or size of content scales. The OOTB search relies on tags such as report names and descriptions, so its results are ordinary at best.
Last year we launched Vozio, a voice-based search plugin that leveraged NLP and a pre-indexed metadata store with not just metadata relationships but also data relationships. You can ask Vozio questions such as “What was the revenue in store #22 in Q1, 2015?”.
Vozio’s metadata store lacked two things:
The capture and codification of relationships associated with non-data elements such as a header on a dashboard
The weightage associated with factors such as placement of the non-data element on a dashboard (larger fonts, higher up placements carry more weight).
An emerging need for both end users and developers was to trace the content all the way to the source. This is especially useful for large implementations that span multiple projects and development teams.
So we set about researching what metadata store could allow a deeper, more meaningful search experience -- with smarter recommendations and a trace function. We landed on building a knowledge graph.
The Knowledge Graph
In order to build a Knowledge Graph, we needed a place to store all the information that we would end up extracting (nodes) and all the relationships we can build around this information (edges).
We selected the Neo4J Graph Database to achieve this. Once we had the graph DB installed and set up, the next step was to populate it with data. The data we needed to build the knowledge graph was:
Dashboards & reports (equivalent to web pages).
Visible and hidden content used on these dashboards and reports
Titles and descriptions (note that this is very different from developer-assigned document name etc., but is actually how the business identifies the document)
Metrics/KPIs along with their business-friendly and derived names
Attributes/dimensions along with their business friendly names
All text content along with their significance on the documents. Significance can be determined by size, positioning (location where it appears on the report), formatting (bolded, underlined etc.)
Selector attributes ( to determine how the information is grouped)
Prompts (to determine how the information is filtered)
Dashboard navigations (which dashboards are linked from the dashboard under consideration)
Automatic categorization of dashboards. Based on the extracted content, we planned to automatically group (cluster) certain documents together.
Mining the Metadata
The first step was to capture, rank, build relationships, and index the metadata mentioned above. There is no easy way to grab the data in OOTB BI platforms, so we developed a mining tool (built using the platform SDK).
The strength of the relationships was determined by factors like the number of shared reporting objects (attributes, metrics, etc.). Weightage was determined by factors like font size, positioning and formatting of content.
Once all the documents were mined and the extracted data loaded to Neo4J, we could see a very rich network that automatically showed us clusters (groups) of similar documents by content type. The clusters provide more relevant search results and can be leveraged for a smarter recommendation engine.
Through the mining tool, we also captured the following information for metrics and attributes used in the reports/datasets for the documents:
Business-friendly metric names (This is not as easy as one might think, but we think we cracked the code)
DB table to where the facts are mapped
DB column to where the facts are mapped
Leveraging the knowledge graph we were able to provide to our users a complete trace path from origin to end document. This also helped us to find and fix instances where the metrics were mapped to wrong tables/sources (huge compliance red flag!), a win-win for both technology and business alike.
We are currently working on tightly integrating a more advanced NLP engine and Vozio with the knowledge graph.
For any additional questions or if you want a more detailed overview of the mining tools and the Knowledge Graph please contact us; one of our engineers will giddily show our work!