Extracting Requirements From Existing Data Sources With a Parser

Extracting requirements from existing data sources has 3 aspects:

  • The methodology;
  • The tools;
  • The people.

The methodology

The approach consists usually of 5 steps:

  1. Identification;
  2. Formalization;
  3. Standardization;
  4. Generalization;
  5. Automation.

We discuss each in the next paragraphs.

Identification

The first step is what is identify exactly what is wanted. This is a very difficult step, perhaps the most difficult of all. The result will define what is be done, and what not. It must include test specifications, so we can objectively check the result of this project is what we all were looking for. Two tools may be used to provide discussion material:

  1. Print a part of the data input, and provide the user with 3 marker pens with different colors . RED for must be handled, GREEN for nice to have, and BLUE for later.
  2. run the report through cross referencer, and study the outcome carefully. This gives you a clue of all different words and character groupings that are in the original data.

After this step we should know what result really makes a difference, and also why do we undertake this project.

Formalization

Create complete sentences in natural language describing object and action.

Standardization

Turn the sentences into a standard like current time, active and single form.

Generalization

Where simply possible rephrase the sentences combining two or more sentences in one new sentence covering both topics. This will make the specifications more abstract and less dependent on here and now. They will become more robust.

Automation

Transform the sentences in a form required for the chosen tool. Define test cases.

The tools

There are at least 3 types of tools you could use to extract data form an existing source:

  • A lexical analyzer/parser;
  • A clustering program;
  • A fuzzy logic program.

When to use which tool

During the life cycle of an application or system the requirements for approach, tools and personnel change. The choice is dependent on the habits and culture in the company. Also, it is very important the maximally use the input of the business user. To be able to do that you must know the background of the user, his interest, and the allowable time.

When to use a parser
Opportunity Approach
Throw away software for one time use Regular expressions (sed, awk, perl)
Prototype or pathfinder for a usability check flex/lex, accent, or bison/yacc
Formal release Choose any approach you are comfortable with (3rd or 4th generation language, tools)
Boots trapping or stepwise refinement flex/lex, accent or bison/yacc
Measure the trustworthiness of the original application flex/lex, accent, bison/yacc
Relative small amount of data, and nontransparent rules, and rules even probably changing fuzzy solutions
Relative more discrete data, and nontransparent constant rules clustering solutions

You have chosen a parser solution, because:

  • You want to extract information from a large collection of data, and you can specify explicit rules;
  • You want to transform this data into another more appropriate form according to specific rules.

Examples are extracting info, restructuring data or adding extra checks to the data. Suppose you have a massive file being the original output file from a very old system. Instead of modifying the report in the original system you could extract the info you need. Also, suppose you want to add totals or subtotals, or a graphical representation. By extracting and transforming you can reuse the data without much problems. In order to get some indication of the stability of the original data you could use another algorithm to calculate totals like in double-entry accounting. Of course both totals should be equal. The difference gives you an idea of the trustworthiness of the data.

Sources of data could be old (batch) report, an email archive, or web pages.

A parsing solution has usually 3 steps:

  1. Lexical analysis;
  2. Syntax analysis;
  3. Semantic analysis.

Types of parsers, suitable for this task and with an acceptable learning curve for incidental usage:

  • Context free (CF) (Type 2);
  • Regular (type3).

Regular parsers are actually lexical analyzers and can be handled as such. Context free parses come in two types:

  • General;
  • Linear.

When to use a general parser:

  • Grammar changes over time;
  • Grammar is ambiguous.

Examples of general parsers are:

  • Unger;
  • Early;

Properties of Unger:

  • Easy to program;
  • Easy to fine tune;
  • Grammar is easy to understand.

With a general parser semantic actions are always executed afterwards when the parser has finished. When to use a linear parser:

  • Stable grammar;
  • Acceptable that grammar might be slightly altered from original.

The linear parser:

  • LL(1);
  • LALR(1);

Properties of LL(1):

  • Semantic actions can be applied even before parsing starts;
  • Usually stronger changes in original grammar necessary than LALR(1);
  • Easier to understand and modify (for all project participants);
  • Memory usages and speed comparable with LALR(1).

Examples of C generating parser tools:

Parser methodology and tools
Methodology Tool
Unger -
Early Accent
Tomita -
LL(1) ANTLR
LALR(1) YACC, Bison, BYACC

The people

For re-engineering and extracting requirements from data you need an experienced business analyst. This person needs extensive experience in:

  • The business domain;
  • Active collecting business user input;
  • Reducing specifications to a meta level.

back to top