Match algorithms capable of handling large amounts of dat% without giving up expressiveness are a key requirement for successful integration of relational database systems and powerful rule-based systems. Algorithms that have been used for database rule systems have usually been unable to support large and complex rule sets, while the algorithms that have been used for rule-based expert systems do not scale welt with data. Furthermore% these algorithms do not ~ovide support for collection (or set) oriented production languages. This paper ~oposes a basic shift in the nature of match algorithms: from tuple-oriented to collectwn-oriented. A collection-oriented match algorithm matches each condition in a production with a collection of tuples and generatea collection-oriented instarrtiutwns, i.e., instantiation that have collection of tuples corresponding to each condition. This approach shows great promise for efllciently matching expressive productions against large amounts of data. In addition, it provides direct support for collection-oriented production languages. We have found that many existing tuple-oriented match algorithms can be easily rmnsfonned to their collection-oriented analogues. This paper presents the transformation of Rete to Collection Rete as an example and compares the two based on a set of benchmarks. Results presented in this paper show tha~ for large amounts of data, a relatively underoptitnized implementation of Collection Rete achieves orders of magnitude improvement in time and space over an optimized version of Rete. The results establish the feasibility of collection-oriented match for integrated database-production systems.
Knowledge-refinement is a central problem in the field of expert systems[Buchanan and Shortliffe,1984]. It refers to the progressive refinement of the initial knowledge-base of an expert systeminto a high-performance knowledge-base. For rule-based systems, refinement implies the addition,deletion and modification of rules in the system so as to improve the system’sempirical adequacy,i.e., its ability to reach correct conclusions in the problems it is intended to solve[Ginsberget al.,1988].The goal of our research effort is to understand the methodology for refining large rule-basedsystems, as well as to develop tools that will be useful in refining such systems. The vehicle forour investigation isspam, a production system (rule-based system) for the interpretation of aerialimagery[McKeownet al., 1985, McKeownet al., 1989]. It is a mature research system havingover 600 productions, many of which interact with a variety of complex (non-rule-based) geometricalgorithms. A typical scene analysis task requires between50,000 to 400,000 production firings andan execution time of the order of 2 to 4 cpu hours1.Large, compute-intensive systems likespamimpose some unique constraints on knowledge refine-ment. First, the problem of credit/blame-assignment is complicated; it is extremely difficult toisolate a single culprit production (or a set of culprit productions) to blame for an error observed inthe output. As a result, the methodology adopted in well-known systems such asseekandseek2[Politakis and Weiss, 1984, Ginsberget al., 1988], orkrust[Craw and Sleeman, 1991], cannotbe directly employed to refine knowledge inspam. These systems identify errorful output andbackward-chain through the executed rules to localize the source of the error. A second problemis thatspam’s long run-times make it difficult to rely on extensive experimentation for knowledgerefinement. Iterative refinement and generate and test methods typically require many experimentsto improve the knowledge base. In particular,spam’s long run-times prohibit a thorough search ofthe space of possible refinements.Givenspam’s constraints, we have employed a bottom-up approach for knowledge refinement.Specifically, we begin by refining small portions ofspam, and then attempt to understand the inter-actions of these refinements and their impact on intermediate results. Fortunately,spamis alreadydivided into four phases, facilitating the identification of the ”modular” pieces on which we focusour refinement efforts, as well as our evaluation efforts (discussed below). We begin by identifyinggaps and/or faults within small portions ofspam’s individual phases by comparing their output tothat of an expert. The knowledge is modified to more accurately match the expert’s output. Wethen evaluate the new output to see how well the refined knowledge is performing. This methodchains forward from the refined knowledge to an evaluation ofit’s performance. This is in contrastto the backward-chaining systems cited above, which identify errorful output and reason backwardto find the faulty knowledge (i.e., assign credit/blame). Backward chaining is difficult when theinteractions between rules are complex. In particular, inspam, this implies backward-chaininghundreds of thousands of rule firings as well as applicationsof complex algorithmic constraints,which would be extremely difficult.However, forward-chaining does not obviate the credit/blame assignment problem. It introducesthe problem in a new form — that of evaluating the impact of theknowledge refinements. Inparticular, in forward-chaining systems, given complex rule interactions, refinement in one partof the system need not improve the end result. This may occur because the refined portion maycontribute to overall improvement in a relatively small number of cases; or a second culprit compo-nent involved in the chaining may actually minimize the impact of the refinements. Hence, there is a need for evaluation and analysis ofintermediateresults. Of course, these intermediate resultsshould more than exactly evaluate the parts of the system refined as a result of interaction withthe expert — they should progressively yield a better understanding of how to change the rules tobetter the system’s overall performance.spam’s decomposition into phases helps us to a certainextent: the endpoints of the phases serve as potential intermediate evaluation points.In our work so far, we have focused on the second phase inspamlocal-consistency (lcc). Thisphase was chosen because most ofspam’s time is spent in this phase, and because it shows the mostpotential for future growth.lccperforms a modified constraint satisfaction between hypothesesgenerated inspam’s first phase. It applies constraints to a set of plausible hypotheses and prunesthe hypotheses that are inconsistent with those constraints. For example, objects hypothesized ashangar-buildings and objects hypothesized as parking-aprons would participate in the constrainthangar-buildings and parking-aprons are close together. This rule is realized as a distance con-straint on pairs of objects hypothesized to be hangar-buildings and parking-aprons. A successfulapplication of anlccconstraint provides support for each pair of hypotheses, and an unsuccessfulapplication decreases support for that pair of hypotheses.In essence, each constraint in thelccphase classifies pairs of hypotheses — either the constraintsupports that pair, or it does not. (lccandspamare described in further detail in Section 2.)In working toward refining these constraints, we posed several questions:1. What role does this constraint play in the interpretation process? 2. If the constraint does play a role, is it positive (helpful) or negative (unhelpful)?3. If the role is positive, and the constraint is numeric, can we optimize the constraint values?4. How effective (powerful) is this constant? 5. What is this constraint’s impact on run time?To address these questions, we began by asking a user to manually compile a database of correct intermediate outputs — the ground-truth database. The database and the system outputs were compared. An automatic procedure was created to adjust the spam knowledge base so that the correspondence between the ground-truth database and the knowledge base improved. This adjusted knowledge was then re-run through the system and the entire set of intermediate outputs(refined or otherwise) were evaluated. The somewhat surprising result of this analysis was that the expert’s input did not help as much as expected. This result raised questions about why the expert’s input was not as helpful as anticipated, and how the results should be evaluated.In order to better understand both the interactions between constraints and the effects of modifying the knowledge base, we did a brute-force evaluation of the results, providing some answers to the questions we posed above. Namely, the distance constraints do play positive roles in spam’s interpretation process, though they are not very powerful.More importantly, we now see that re-fined knowledge may apply selectively. That is, the set of objects affected by individual constraints largely overlaps, implying that spam could reduce computation by selectively applying constraints and still achieve the similar performance. Tools eliciting the expert’s input must carefully take these interactions into consideration. Finally, we can also conclude that intermediate result evaluation is not straight-forward. For example, the complex structures that spam generates using the results of constraint application, called functional-areas, must be matched and these matches evaluated.Having provided some of the motivation for this work, we begin by describing the spam image interpretation system system. Following this background material, we present our refinement methodology and an analysis of our results. We then re-evaluate the methodology and discuss further experimental results. Finally, we will outline some areas of future work.
Machine learning approaches to knowledge compilation seek to improve the perfonnance of problem-solvers by storing solutions to previously solved problems in an efficient, generalized fonn. The problem-solver retrieves these learned solutions in appropriate later situations to obtain results more efficiently. However, by relying on its learned knowledge to provide a solution, the problem-solver may miss an alternative solution of higher quality - one that could have been generated using the original (non-learned) problem-solving knowledge. This phenomenon is referred to as the ITUlSking effect of learning.
In this paper, we examine a sequence of possible solutions for the masking effect. Each solution refines and builds on the previous one. The fmal solution is based on cascaded filters. When learned knowledge is retrieved, these filters alert the system about the inappropriateness of this knowledge so that the system can then derive a better alternative solution. We analyze conditions under which this solution will perfonn better than the others, and present experimental data supportivt: of the analysis. This investigation is based on a simulated robot domain called Groundworld.