How to smoothly transit from Business Intelligence to Advanced Analytics.

Today Advanced Analytics has its roots in Business Intelligence and goes beyond it towards Artificial Intelligence. While Business Intelligence may be realized via common data-driven facilities allowing to inquire, report and process online analytics it also shows “what happens”, “when it happens” and how many or which items and events are connected to the target of request. Advanced Analytics is based on math predictive modeling, statistics, hundreds of domain rules and clustering. The core of the advanced analysis is text processing. It plays the key role in relation mining, building semantic or decision trees, discovering sentiments, ontologies, collections of items for clustering, dividing sentences into subtrees. In other words text processing analyses every single token, classifies it as an object “in context of something”, while Business Intelligence algorithms usually are able to identify an object as single entity.

Thus appear multiple subtleties and mishaps in business process and project design. In the view of moving on with the enterprise system it seems like a good idea to introduce some intelligence. It is usually done by linear programming and heuristic methods. Such an approach results in data migrating to one of the existing Big Data ecosystems, allowing to create data-clusters with near real-time response. Later on developers, data scientists or domain “linguists” perform a lot of specific coding, covering specific cases for business process and domain entities. It brings about such functions as keyword searching, and algorithms like clustering, machine learning, etc., all of which use keywords. And alas, this is where the complications begin. If your business value is the phone or some kind of well defined goods it may work well. However, if you need to make a prediction or would like to receive reports as a result of clustering algorithms, it may be not enough. A thin phone is a matter of high-end innovations, but a thin network channel may be the result of the lack of any innovation. Finally, a “thin client” may be just a specific term used in software design in a neutral fashion. All the above means that transition of BI to AA requires analyzing the whole context, starting with the domain area and document topic, covering every single sub-tree of paragraphs, sentences, SVO triples, and eventually concepts and words. To cut a long story short, one cannot find a sentiment in a text without analyzing the whole sentence and sometimes even the type of the source.

It is at this stage that machine learning takes place. Splitting text initially into sentences and eventually into SVO-triples is not always deemed possible. Using grammar rules works well in literary works and well formatted documents, but in articles, blogs, social network posts and even news bulletins things are different. Analyzing such an incoming text source you have to figure out the language, code page, topic, document format, paragraph and so on. Natural Language Identification may be carried out with the help of Markov Model that is based on the statistical model. In order to create statistics an enormous amount of work has to be done by linguists. After all there is as an array of data on every more or less widespread language. Searching for a part of speech is a much more complicated task that requires a long range of datasets, including documents and articles as well as a toolset to operate and monitor all of them. Hence the solution, which has a lightweight database with some know-how based on the results of the machine learning process combined with dictionaries.

Let us say you need to create an enterprise data storage and divide its content into sections on business value and private activities. It would be a mostly unstructured database, in which emails, charts and corporate documents have to be subject to any request. Without an efficient Advanced Analytics solution one has to setup a well scalable cluster system like Spark (with relevant tools) and fill it with data. Afterwards developers can start coding all the specifics for relevant requests. And the use-cases that are to be covered by coding are often beyond the sphere of interest of software developers. That is why it may be hard for the whole team to understand how to arrange data in an unstructured database. Any value change may require a lot of coding aimed at the core of the whole system. Integrating it with the existing corporate system can add some pain. It is quite expensive so a heuristic method is recommended to implement the Advanced Analytics technology into the business ecosystem. One of the challenges of such processes is to connect the existing software that has been developing for more than 10 years with C++ and RDBMS and moderns JAVA API which operate on streams and special scale-functioning entities. Moreover, dividing algorithms into “data scientist” tasks and “developer” tasks is an art and a challenge. That is way using solutions like Linguistic Processors may point at the light at the end of the tunnel to the team, separating data scientist work environment from linguist, allowing software developers to peacefully engage in the art of creating the software. Having used it data scientist may create his dataset and performing Machine Learning separately connecting to linguist or developer when it is necessary. Filling up database store without preprocessing with the aid of the Linguistic Processor involves many specifically linguistic tasks intended for software developers and may imply the need in multiple program languages skills for data scientists.

Out of the box Linguistic Processor allows to tokenize/divide by segment, sentences split, part of speech (tagging), lemma, Named Entities Extraction, Relations (SVO, VO, have to, is, part of & etc )

Language Processor relevant ecosystem allows you to do Natural Language Identification, semantic clustering, semantic categorization, semantic document comparison, text summarization, related facts extraction, spellchecking, sentiment processing

Linguistic Processor contains in its environment rule-based approach where linguist can extend/improve LP functioning without programming language skills, dictionary system that may be easy edited and extended, new language introduction facilities, scalable and friendly Data Learning store and Mining API for using in algorithms, friendly toolset for machine learning that data scientist may utilize in any single time without special skills

“Semantic” means humanlike understandability. These are functions that that cannot be easily built with such keyword-processing tools as Lucene. Well, they actually can be built, but they would be not semantic.

So it appears that building analytics by keywords has serious disadvantages. The target solution has to consider every single token as a target for any request. It may overload the whole system within the generic request. In layman’s terms, any keyword-based solution works as follows. The more single words there are in a request, the more references there will be in the response. Conversely, any semantic solution allows to have more specific results in case of a more specific request.

There are similar advantages for clustering algorithms. The Linguistic Processor can easily discover meaningful relations between clusters, or, in other words, calculate their linguistic differences and similarities.

Any Linguistic Processor needs some ad-hoc toolsets used to convert and process documentation. Nowadays most users need to be able to work with web pages, standard word and pdf documents, various mail formats, archives, etc. Preliminary text extraction and preparation may be a very tricky task when customers have a document store with long history. Text extraction might have to do with outdated format documents or specific pdf files with a large amount of printed text on images. Introducing or eliminating OCR in the solution may affect the recall and precision significantly on specific datasets.

As you can see any search or index based service needs a full-fledged NLP core solution aka Linguistic Processor that makes the possibility of a profound semantic analysis available. There are several teams that are dedicated to the development of the Linguistic Processor and software related:

- Stanford CoreNLP. Java based solution. Widely used in many Java Data Mining stack environment.Mostly it is about a standalone command line and a set of library tools that represent a complete Linguistic Processing toolset intended for generic processing. No native classification/clustering tools are employed

OpenNLP – Apache core stack Java-based solution. It is a standard Linguistic Processor with basic semantic text extraction facilities.

Nominator, IBM – a text processing solution that is focused mostly on extracting names, locations and places. It is developed in C, which makes it a robust and portable solution. It takes tokenized source inputs.

WordNet –a large lexical database where words are grouped into verbs, nous, adjectives and more high-level entity classes. It represents some sort of context structuring and relations technique. It seems like a very attractive solution when applied to tasks not requiring any programming language skills.

LinkIT, Columbia University – a linguistic tool for extraction of Simplex Noun Phrases. No Named Entities, no generic sentence parsing and grammar based tree creation. No sentiments. It is a more specific tool that may be a core part of Linguistic Processor Stack algorithms.

Which one would you choose? It is hard to predict but please bear in mind the following:

Any java-based application or library is not supposed to be an all-sufficient generic desktop or a single solution. Usually it is part of some sort of application stack that is supposed to work in an exclusive environment. Both OpenNLP and CoreNLP are perfectly suitable for a cloud-based solution or server-side back-end and require rather powerful server hardware. They are developed as standalone generic Java-tools and are supposed to be a part of the core of target service. For building a clustering/document classification system you need other libraries like Carrot, Apache Mathout, Mallet, Weka altogether with text-processing frameworks.

WordNet is just a text database as well as an online lexical reference system that may be useful in developing various cognitive tasks enabling one to request existing relations and word groups or classes. It is easy to use common system command like tools or scripting languages to perform request and makes tagging of some data without any thirty-part tools and Linguistic Processors.

The integration of any of the above-mentioned technologies has its own specifics and peculiar features. Any C/C++ library can be easy integrated into any system, and does not matter whether it is a Java or a .Net solution. Any desktop standalone application is naturally based on C++ libraries. CoreNLP requires 2 GB of RAM by default for minimal processing and it cannot generally work in a 32-bit environment because of the limitation 2-GB of RAM for any 32-bit system.

Which solution to use is a matter of availability of the final range of products as well as the whole service evolution. It is something that you need to investigate very thoroughly in order to ensure the best results possible.


Miriam R. L. Petruck

Fulbright Distinguished Scholar (2024-25) Fulbright Specialist (2021-25), Senior Research Scientist, FrameNet (AI Group): Ethnographic, Cognitive, and Empirical Research, World Traveler

7y

I recommend using FrameNet (framenet.icsi.berkeley.edu), a much more sophisticated lexical resource (than WN), albeit with less coverage, so combine WN and FN.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics