当前位置：文档库 › with identifier attributes

with identifier attributes

Mach Learn(2006)62:65–105

DOI10.1007/s10994-006-6064-1

Distribution-based aggregation for relational learning

with identi?er attributes

Claudia Perlich·Foster Provost

Received:11August2004/Revised:22February2005/Accepted:05July2005/

Published online:27January2006

C Springer Science+Business Media,Inc.2006

Abstract Identi?er attributes—very high-dimensional categorical attributes such as partic-ular product ids or people’s names—rarely are incorporated in statistical modeling.However, they can play an important role in relational modeling:it may be informative to have com-municated with a particular set of people or to have purchased a particular set of products.A key limitation of existing relational modeling techniques is how they aggregate bags(multi-sets)of values from related entities.The aggregations used by existing methods are simple summaries of the distributions of features of related entities:e.g.,MEAN,MODE,SUM, or COUNT.This paper’s main contribution is the introduction of aggregation operators that capture more information about the value distributions,by storing meta-data about value distributions and referencing this meta-data when aggregating—for example by computing class-conditional distributional distances.Such aggregations are particularly important for aggregating values from high-dimensional categorical attributes,for which the simple aggre-gates provide little information.In the?rst half of the paper we provide general guidelines for designing aggregation operators,introduce the new aggregators in the context of the relational learning system ACORA(Automated Construction of Relational Attributes),and provide theoretical justi?cation.We also conjecture special properties of identi?er attributes, e.g.,they proxy for unobserved attributes and for information deeper in the relationship network.In the second half of the paper we provide extensive empirical evidence that the distribution-based aggregators indeed do facilitate modeling with high-dimensional categor-ical attributes,and in support of the aforementioned conjectures.

Keywords:identi?ers.relational https://www.wendangku.net/doc/4b1419947.html,works

Editors:Hendrik Blockeel,David Jensen and Stefan Kramer

C.Perlich( )

IBM T.J.Watson Research Center

e-mail:perlich@https://www.wendangku.net/doc/4b1419947.html,

F.Provost

New York University

e-mail:fprovost@https://www.wendangku.net/doc/4b1419947.html,

1.Introduction

Predictive modeling often is faced with data including important relationships between entities.For example,customers engage in transactions which involve products;suspicious people may make phone calls to the same numbers as other suspicious people.Extending the traditional“propositional”modeling approaches to account for such relationships introduces a variety of opportunities and challenges.The focus of this paper is one such challenge—the integration of information from one-to-many and many-to-many relationships:a customer may have purchased many products;a person may have called many numbers.

Such n-to-many relationships associate with any particular entity a bag(multiset)of related entities.Since the ultimate objective of much predictive modeling is to estimate a single value for a particular quantity of interest,the predictive model must either ignore the bags of related entities or aggregate information from them.

The aggregation operators used by existing relational modeling approaches typically are simple summaries of the distributions of features of related entities,e.g.,MEAN,MODE, SUM,or COUNT.These operators may be adequate for some features,but fail miserably for others.In particular,if the bag consists of values from high-dimensional categorical attributes,simple aggregates provide little information.Object identi?ers are one instance of high-dimensional categorical attributes,and they are abundant in relational domains since they are necessary to express the relationships between objects.Traditional propositional modeling rarely incorporates object identi?ers,because they typically hinder generalization (for example by creating“lookup tables”).However,the identities of related entities can play an important role in relational modeling:it may be informative to have communicated with a speci?c set of people or to have purchased a speci?c set of products.For example,Fawcett and Provost(1997)show that incorporating particular called-numbers,location identi?ers, merchant identi?ers,etc.,can be quite useful for fraud detection.

Consider the following example of a simple relational domain that exhibits such n-to-many relationships.The domain consists of two tables in a multi-relational database:a target table, which contains one row for each of a set of target entities,about which some attribute value will be estimated,and an auxiliary table that contains multiple rows of additional information about entities related to the target entities.Figure1illustrates the case of a customer table and a transaction table.This simple case is ubiquitous in business applications,such as customer classi?cation for churn management,direct marketing,fraud detection,etc.In each,it is important to consider transaction information such as types,amounts,times,and locations.Traditionally practitioners have manually constructed features before applying a conventional propositional modeling technique such as logistic regression.This manual process is time consuming,becomes infeasible for large and complex domains,and rarely will provide novel and surprising insights.

Relational learning methods address the need for more automation and support of mod-eling in such domains,including the ability to explore information about the many-to-many relationship between customers and products.If the modeling objective is to estimate the likelihood of responding to an offer for a particular book,it may be valuable to incorporate the speci?c books previously bought by the customer,as captured by their ISBNs.The MODE clearly is not suitable to aggregate a bag of ISBNs,since typically books are bought only once by a particular customer.In addition,this MODE feature would have an extremely large number of possible values,perhaps far exceeding the number of training examples.

Fig.1Example of a relational classi?cation task consisting of a target table Customer(CID,CLASS)and a one-to-many relationship to the table Transaction(CID,TYPE,ISBN,PRICE)

We introduce novel aggregators1that allow learning techniques to to capture informa-tion from identi?ers such as ISBNs.This ability is based on(1)the implicit reduction of the dimensionality by making(restrictive)assumptions about the number of distributions from which the values were generated,and(2)the use of distances to class-conditional, distributional meta-data.Such distances reduce the dimensionality of the model estimation problem while maintaining discriminability among instances,and they focus explicitly on discriminative information.

The contributions of this work include:

1.An analysis of principles for developing new aggregation operators(Section2).

2.The development of a novel method for relational feature construction,based on the fore-

going analysis,which includes novel aggregation operators(Section3).To our knowledge, this is the?rst relational aggregation approach that can be applied generally to categorical attributes with high cardinality.

3.A theoretical justi?cation of the approach that draws an analogy to the statistical dis-

tinction between random-and?xed-effect modeling,and identi?es typical aggregation assumptions that limit the expressive power of relational models(Section3.4).

4.A theoretical conjecture(Section3.5)that the aggregation of identi?er attributes can

implicitly support the learning of models from unobserved object properties.

5.An extensive empirical study demonstrating that the novel aggregators indeed can improve

predictive modeling in domains with important high-dimensional categorical attributes, including a sensitivity analysis of major domain properties(Section4).

The proposed aggregation methodology can be applied to construct features from var-ious attribute types and for a variety of modeling tasks.We will focus in this paper on high-dimensional categorical attributes,and on classi?cation and the estimation of class-membership probabilities.Unless otherwise speci?ed we will assume binary classi?cation.

1This paper is an extension of the second half of a prior conference paper Perlich and Provost(2003).

2.Design principles for aggregation operators

Before we derive a new aggregation approach for categorical attributes with high cardinality, let us explore the objectives and some potential guidelines for the development of aggregation operators.2The objective of aggregation in relational modeling is to provide features that improve the generalization performance of the model(the ideal feature would discriminate perfectly between the cases of the two classes).However,feature construction through aggregation typically occurs in an early stage of modeling,or one far removed from the estimation of generalization performance(e.g.,while following a chain of relations).In addition,aggregation almost always involves loss of information.Therefore an immediate concern is to limit the loss of predictive information,or the general loss of information if predictive information cannot yet be identi?ed.

For instance,one measure of the amount of information loss is the number of aggregate values relative to the number of possible unique bags.For example for the variable TYPE in our example,there are?fty-four possible unique,non-empty bags with size less than ten containing values from{Fiction,Non-Fiction}.Consider two simple aggregation operators: MODE and COUNT.MODE has two possible aggregate values and COUNT has nine.Both lose considerable information about the content of the bags,and one might argue that the general information loss is larger in the case of MODE.In order to limit the loss and to preserve the ability to discriminate classes later in the process,it desirable to preserve the ability to discriminate instances:

Principle1.Aggregations should capture information that discriminates instances.

Although instance discriminability is desirable,it is not suf?cient for predictive mod-eling.It is simple to devise aggregators that involve no apparent information loss.For the prior example,consider the enumeration of all possible54bags or a prime-coding ‘Non-Fiction’=2,‘Fiction’=3,where the aggregate value corresponding to a bag is the product of the primes.A coding approach can be used to express any one-to-many rela-tionship in a simple feature-vector representation.An arbitrary coding would not be a good choice for predictive modeling,because it almost surely would obscure the natural similarity between bags:a bag with5‘Fiction’and4‘Non-Fiction’will be just as similar to a bag of9‘Fiction’books as to a bag of5‘Fiction’and5‘Non-Fiction’books.In order for aggregation to produce useful features it must be aligned with the implicitly induced notion of similarity that the modeling procedure will(try to)take advantage of.In particular,capturing predictive information requires not just any similarity,but similarity with respect to the learning task given the(typically Euclidean)modeling space.For example,an ideal predictive numeric feature would have values with small absolute differences for target cases of the same class and values with large absolute differences for objects in different classes.This implies that the aggregates should not be independent of the modeling task;if the class labels were to change,the constructed features should change as well.

Principle2.Aggregates should induce a similarity with respect to the learning task,that facilitates discrimination by grouping together target cases of the same class.

2Related issues of quantifying the goodness of transformation operators have been raised by G¨a rtner et al.(2002)in the context of“good kernels”for structured data.

Thus,we face a tradeoff between instance discriminability and similarity preservation. Coding maintains instance discriminability perfectly,but obscures almost certainly the sim-ilarity.COUNT and MODE on the other hand lose much instance discriminability,but will assign identical values to bags that are in some sense similar—either to bags of identical size,or to bags that contain mostly the same element.However,whether or not COUNT or MODE are predictive will depend on the modeling task.They do not induce a task-speci?c similarity as their values are independent of the particular class labels.

Furthermore,since most similarity-preserving operators involve information loss,it might be advantageous to use multiple operators.A combination of orthogonal features could on the one hand capture more information and on the other increase the probability that one of them is discriminative for the speci?c modeling task.

Principle3.Various aggregations should be considered,re?ecting different notions of similarity.

For our example,consider the following alternative aggregation.Rather than capturing all information into a single aggregate,construct2attributes,one count for each value‘Fiction’and‘Non-Fiction’.The two counts together maintain the full information.Unfortunately, constructing counts for all possible values is possible only if the number of values is small compared to the number of training examples.3

These design principles suggest particular strategies and tactics for aggregation:

r Directly use target(class)values to derive aggregates that already re?ect similarity with

respect to the modeling task.

r Use numeric aggregates,since they can better trade off instance discriminability and

similarity.

r Use multiple aggregates to capture different notions of similarity.

We present in Section3.3a novel aggregation approach based on these principles,that is particularly appropriate for high-dimensional categorical variables.

3.Aggregation for relational learning

To provide context for the presentation of new aggregates,and the basis for a comprehen-sive empirical analysis of aggregation-based attribute construction,we will present brie?y a learning system that can be applied to non-trivial relational domains.ACORA(Auto-mated Construction of Relational Attributes)is a system that converts a relational domain into a feature-vector representation using aggregation to construct attributes automatically. ACORA consists of four nearly independent modules,as shown in Figure2:

r exploration:constructing bags of related entities using joins and breadth-?rst search,

r aggregation:transforming bags of objects into single-valued features,

3Model induction methods suitable for high-dimensional input spaces may confer an advantage for such cases, as they often do for text.However,the transformation is not trivial.Since the input space is structured,it may be bene?cial to tag values with the relation(chain)linking them to the target variable—author-smith versus cited-author-smith versus coauthor-smith,etc.Also,for relational problems even producing single-number aggregations can lead to a large number of features if there are moderately many relation chains to consider. Alternatively,such methods may be a useful alternative to this paper’s methods for creating aggregation features:for example,include features representing the classi?cation score based on authors alone,based on cited-author,based on coauthor,etc.

Fig.2ACORA’s transformation process with four transformation steps:exploration,feature construction,feature selection,model estimation,and prediction.The ?rst two (exploration and feature construction)transform the originally relational task (multiple tables with one-to-many relationships)into a corresponding propositional task (feature-vector representation)

r feature selection,and r model estimation.

Figure 3outlines the ACORA algorithm in pseudocode.Since the focus of this work is on aggregation we will concentrate on distribution-based aggregation assuming bags of values.Producing such bags of related objects requires the construction of a domain graph where the nodes represent the tables and the edges capture links between tables through identi?ers;this is explained in more detail in Appendix A.Following the aggregation,a feature selection procedure identi?es valuable features for the modeling task,and in the ?nal step ACORA estimates a classi?cation model and makes predictions.Feature selection,model estimation,and prediction use conventional approaches including logistic regression,the decision tree learner C4.5(Quinlan,1993),and naive Bayes using WEKA (Witten &Frank,1999),and are not discussed further in this paper.

The main idea behind the feature construction is to store meta-data on the (class-conditional)distributions of attributes’values,and then to use vector distances to compare the bags of values associated with particular cases to these distributional meta-data.In order to describe this precisely,we ?rst must introduce some formal notation.

3.1.Setup and notation

A relational probability estimation (or classi?cation)task is de?ned by a set of tables Q ,R ,...(denoted by uppercase letters),including a particular target table T in a multi-relational database RD

B .Every table R contains rows r (denoted in lowercase).The rows t of T are the target entities or target cases .Each table R has n R ?elds and a row r represents the vector of ?eld-values r =(r .f 1,...,r .fn R )for a particular entity,which we will shorten to r =(r .1,...,r .n R ).Thus,R .f denotes a ?eld variable in table R ,and r .f denotes the value of R .f for entity r .

The domain or type ,D (R .j ),of ?eld j in table R is either R in the case of numeric attributes,or the set of values that a categorical attribute R .j can assume;in cases where this is not known a priori,we de?ne D (R .j )= e ∈R r .j ,the set of values that are observed in ?eld j across all rows r of table R .The cardinality |D (R .j )|of a categorical attribute is equal to the number of distinct values that the attribute can take.

One particular attribute T .c in the target table T is the class label for which a model is to be learned given all the information in RDB .We will consider binary classi?cation where D (T .c )={0,1}.The main distinction between relational and propositional model induction

Fig.3Pseudocode of the ACORA algorithm

is the additional information in tables of RDB other than T.This additional information can be associated with instances in the target table via keys.The conventional de?nition of a key requires a categorical attribute R.k to be unique across all rows in table R(the cardinality of the attribute is equal to the number of rows in the table).A link to information in another table Q is established if that key R.k also appears as Q.l in another table Q,where it would be called a foreign key.This de?nition of a foreign key requires an equality relational ER between the types of pairs of attributes E R(D(R.k),D(Q.l)).We will assume that for the categorical attributes in RDB this equality relation is provided.

More fundamentally,keys are used to express semantic links between the real entities that are modeled in the RDB.In order to capture these links,in addition to entities’attributes we also must record an identi?er for each entity.Although database keys often are true identi?ers (e.g.,social security numbers),all identi?ers are not necessarily keys in a particular RDB. This can be caused either by a lack of normalization of the database or by certain information not being stored in the database.For example consider domains where no information is provided for an entity beyond a“name”:shortnames of people in chatrooms,names of people transcribed from captured telephone conversations,email addresses of contributors in news groups.In such cases RDB may have a table to capture the relations between entities,but not

a table for the properties of the entity.This would violate the formal de?nition of key,since

there is no table where the identi?er is unique.An example of an identi?er that is not a key is the ISBN?eld in the transaction table in Figure1.

Without semantic information about the particular domain it is impossible to say whether a particular?eld re?ects the identity of some real entity.A heuristic de?nition of identi?ers can be based on the cardinality of its type(or an identical type under ER):

De?nition1.R.k is an identi?er if D(R.k)=R and

?Q.l with a cardinality≥I MIN and E R(D(R.k),D(Q.l))

Informally,a identi?er is a categorical attribute where the cardinality of its type or some equivalent type is larger than some constant I MIN.Note that for many domains the distinction between keys and identi?ers will be irrelevant because both de?nitions capture the same set of attributes.If I MIN is at most the size of the smallest table,the keys will be a subset of the identi?ers.The use of identi?ers to link objects in a database(still assuming an equality relation between pairs of?elds)will therefore provide at least as much information or more than the use of keys.The choice of I MIN is bounded from above by s t,the size of the target table.4

A direct relationship between entities is a pair of identi?er?elds(Q.l,R.k)of equivalent type.For the modeling task we are mostly interested in entities that are related directly or indirectly to the cases in the target table T.Indirect relationships are captured by chains of identi?er pairs such that the chain starts from the target table T and the second attribute of a pair is in the same table as the?rst attribute of the next pair:(T.n,Q.m).(Q.l,R.k)....The bag

B of objects related to a case t in T under a relationship(T.n,R.k)is de?ned as B R(t)= {r|t.n=r.k}and the bag of related values of?eld R.j is de?ned as B R.j(t)={r.j|t.n=r.k}. For simplicity of notation we present this de?nition only for direct relationships and do not even index the bag by the the full details of the underlying relationship,but only by the?nal table;the extension to indirect relationships is straightforward.The reader should generally keep in mind that B is not de?ned globally but for a speci?c relationship chain.

3.2.Simple aggregation

In order to apply traditional induction techniques,aggregation operators are needed to incorporate information from one-to-many relationships as in our example in Figure1, joining on CID.The challenge in this example is the aggregation of the ISBN attribute, which we assume has cardinality larger than I MIN.An aggregation operator A provides a mapping from bags of values B R.j(t)to R,to N,or to the original type of the?eld D(R.j). Simple aggregation operators for bags of categorical attributes are the COUNT,value counts for all possible values v∈D(R.j),and the MODE.The COUNT=|B R.j(t)|captures only the size of the bag.COUNT v for a particular value v is the number of times value v appeared in the bag B R.j(t),and the MODE is the value v that appears most often in B R.j(t).In the example,MODE(B Transaction.TYPE(C2,1))=‘Non-Fiction’for the bag of values from the TYPE?eld in the Transaction table,related to the case‘C2,1’in the customer table through the CID identi?er.

None of these simple aggregates is appropriate for high-cardinality?elds.For example, since most customers buy a book only once,for bags of ISBNs there will be no well-de?ned 4There is no clear lower limit,but very small choices(e.g.,below50)for I MIN are likely to have a detrimental effect on model estimation,in terms of run time,and potentially also in terms of accuracy because too many irrelevant relationships will be considered.

MODE.The number of counts(all equal to either zero or one)would equal the cardinality of the identi?er’s domain,and could exceed the number of training examples by orders of magnitude—leading to over?tting.

More generally,and independently of our de?nition of identi?ers,any categorical attribute with high cardinality poses a problem for aggregation.This has been recognized implicitly in prior work(see Section5),but rarely addressed explicitly.Some relational learning systems (Krogel&Wrobel,2001)only consider attributes with cardinality of less than n,typically below50;Woznica et al.(2004)de?ne standard attributes excluding keys,and many ILP (Muggleton&DeRaedt,1994)systems require the explicit identi?cation of the categorical values that may be considered for equality tests,leaving the selection to the user.

3.3.Aggregation using distributional meta-data

Aggregation summarizes a set or a distribution of values.As we will describe in detail, ACORA creates reference summaries,and saves them as“meta-data”about the unconditional or class-conditional distributions,against which to compare summaries of the values related to particular cases.

Although its use is not as widespread as in statistical hypothesis testing,distributional meta-data are not foreign to machine learning.Naive Bayes stores class-conditional likeli-hoods for each attribute.In fraud detection,distributions of normal activity have been stored, to produce variables indicating deviations from the norm(Fawcett&Provost,1997).Ag-gregates like the mean and the standard deviation of related numeric values also summarize the underlying distribution;under the assumption of normality those two aggregates fully describe the distribution.Even the MODE of a categorical variable is a crude summary of the underlying distribution(i.e.,the expected value).In the case of categorical attributes,the distribution can be described by the likelihoods—the counts for each value normalized by the bag size.So all these aggregators attempt to characterize for each bag the distribution from which its values were drawn.Ultimately the classi?cation model using such features tries to?nd differences in the distributions.

Estimating a distribution from each bag of categorical values of a high-cardinality attribute is problematic.The number of parameters(likelihoods)for each distribution is equal to the attribute’s cardinality minus one.Unless the bag of related entities is signi?cantly larger than the cardinality,the estimated likelihoods will not be reliable:the number of parameters often will exceed the size of the bag.5We make the simplifying assumption that all objects related to any positive target case were drawn from the same distribution.We therefore only estimate two distributions,rather than one for each target case.

Table1presents the result of the join(on CID)of the two tables in our example database (step7in the pseudocode).Consider the bag B Transaction(C2,1)of related transactions for customer C2:

(C2,Non-Fiction,231,12.99),(C2,Non-Fiction,523,9.49),(C2,Fiction,856,4.99) The objective of an aggregation operator A is to convert such a bag of related entities into a single value.In step8of the pseudocode,this bag of feature vectors is split by at-tribute into three bags B TYPE(C2,1)= Non-Fiction,Non-Fiction,Fiction ,B ISBN(C2,1)=

5The same problem of too few observations can arise for numeric attributes,if the normality assumption is rejected and one tries to estimate arbitrary distributions(e.g.,through Gaussian mixture models).

Table 1Result of the join of the Customer and Transaction tables on CID for the example classi?cation task in Figure 1.For each target entity (C1to C4)the one-to-many relationship can result in multiple entries (e.g.,three for C2and four for C4)highlighting the necessity of aggregation

CID

CLASS TYPE ISBN PRICE C 1

0Fiction 5239.49C 2

1Non-?ction 23112.99C 2

1Non-?ction 5239.49C 2

1?ction 8564.99C 3

1Non-iction 23112.99C4

0Fiction 6737.99C 4

0Fiction 47510.49C 4

0Fiction 8564.99C 40Non-?ction 9378.99 231,523,856 ,and B PRICE (C 2,1)= 12.99,9.49,4.99 .Aggregating each bag of attributes separately brings into play an assumption of class-conditional independence between at-tributes of related entities (Perlich &Provost,2003).ACORA may apply one or more aggregation operators to each bag.Simple operators that are applicable to bags of numeric attributes such as B Transactions .PRICE ,or B PRICE for short,include the SUM = c ∈B PRICE or the M E AN =SU M /|B PRICE |.Consider on the other hand B ISBN (C 2,1)= 231,523,856 .ISBN is an example of a bag of values of an attribute with high cardinality,where the MODE is not meaningful because the bag does not contain a “most common”element.The high cardinality also prevents the construction of counts for each value,because counts for each possible ISBN would result in a very sparse feature vector with a length equal to the cardinality of the attribute (often much larger than the number of training examples),which would be unsuitable for model induction.

3.3.1.Reference vectors and distributions

The motivation for the new aggregation operators presented in the sequel is twofold:(1)to deal with bags of high-cardinality categorical attributes for which no satisfactory aggregation operators are available,and (2)to develop aggregation operators that satisfy the principles outlined in Section 2in order ultimately to improve predictive performance.Note that even if applicable,the simple aggregates do not satisfy all the principles.

The main idea is to collapse the cardinality of the attribute by applying a vector distance to a vector representation both of the bag of related values and of a reference distribution (or reference bag).Reference bags/distributions are constructed as follows.Let us de?ne a case vector CV R .j (t )as the vector representation of a bag of categorical values B R .j (t )related to target case t .Speci?cally,given an ordering,N :D (R .j )→N ,and a particular value v of ?eld R.j ,the value of CV R .j (t )at position N (v )is equal to the number of occurrences of value v in the bag.

CV R .j (t )[N (v )]=COUNT v (1)For example CV Transaction TYPE (C 2,1)=[2,1]for B Transaction .TYPE (C 2,1)= Non-Fiction,Non-Fiction,Fiction ,under the order N (Non-Fiction)=1,N (Fiction)=2.

Based on the case vectors in the training data,the algorithm constructs two class-conditional reference vectors RV 0and RV 1and an unconditional reference vector RV ?:

Table2“Case vector”representation of the bags of the TYPE and ISBN attributes for each target case (C1to C4)after the exploration in Table1.The vector components denote the counts of how often a value appeared in entities related to the target case

TYPE Non-?ction Fiction

CV(C1,0)01

CV(C2,1)21

CV(C3,1)10

CV(C4,0)13

ISBN231475523673856937

CV(C1,0)001000

CV(C2,1)101010

CV(C3,1)100000

CV(C4,0)010111

RV0R.j[N(v)]=1

{t|t.c=0}

CV R.j(t)[N(v)](2)

RV1R.j[N(v)]=1

{t|t.c=1}

CV R.j(t)[N(v)](3)

RV?R.j[N(v)]=

s1+s0

CV R.j(t)[N(v)](4)

where s0is the number of negative target cases and s1is the number of positive target cases,and[k]denotes the k th component of the vector.RV Rj1[N(v)]is the average num-ber of occurrences of value v related to a positive target case(t.c=1)and RV R.j0[N(v)] the average number of occurrences of a value v related to a negative target case(t.c= 0).RV R.j[N(v)]is the average number of occurrences of the value related to any target case.We also compute distribution vectors DV0,DV1and DV*that approximate the class-conditional and unconditional distributions from which the data would have been drawn:

DV0R.j[N(v)]=

{t|t.c=0}

b t

{t|t.c=0}

CV R.j(t)[N(v)](5)

DV1R.j[N(v)]=

{t|t.c=1}

b t

{t|t.c=1}

CV R.j(t)[N(v)](6)

DV?R.j[N(v)]=

{t∈T

b t

{t∈T}

CV R.j(t)[N(v)](7)

where b t is the number of values related to target case t(the size of bag B R.j(t)).For the example,the case vectors for TYPE and ISBN are shown in Table2and the reference vectors and distributions in Table3.Extend the pseudocode of step8:

Table 3Reference vectors and reference distributions for the TYPE and ISBN attributes for objects in Table 1:class-conditional positive DV 1,class-conditional negative DV 0,and unconditional distribution DV ?.The reference vectors and reference distributions capture the same information,but with different normalizations:division by the number of target cases or by the number of related entities

TYPE

Non-?ction Fiction RV 1

1.50.5RV 0

0.5 2.0RV ?

1.0 1.25DV 1

0.750.25DV 0

0.200.80DV ?0.440.55

ISBN

231475523673856937DV 1

0.500.2500.250DV 0

00.20.20.20.20.2DV ?

0.220.110.220.110.220.11RV 1

100.500.50RV 0

00.50.50.50.50.5RV ?0.50.250.50.250.50.25

3.3.2.Distances to reference vectors and distributions

The aggregation in step 11of ACORA’s pseudocode now can take advantage of the reference vectors by applying different vector distances between a case vector and a reference vector.An aggregation was de?ned as a mapping from a bag of values to a single value.We now de?ne vector-distance aggregates of categorical values of attribute R .j as:

A (

B R .j (t ))=DIST (RV ,CV R .j (t ))(8)

where DIST can be any vector distance and RV ∈{RV 0R .j ,RV 1R .j ,RV ?R .j ,DV 0R .j ,

DV 1R .j ,DV ?R .j }.ACORA offers a number of distances measures for these aggregations:likelihood,Euclidean,cosine,edit,and Mahalanobis,since capturing different notions of distance is one of the principles from Section 2.In the case of cosine distance the normal-ization (RV 0vs.DV 0)is irrelevant,since cosine normalizes by the vector length.

Consider the result of step 12of the algorithm on our example (Table 4),where two new attributes are appended to the original feature vector in the target table,using cosine distance to RV 1for the bags of the TYPE and the ISBN attributes.Both features appear highly predictive (of course the predictive power has to be evaluated in terms of out-of-sample performance for test cases that were not used to construct RV 0and RV 1).

Observe the properties of these operators in light of the principles derived in Section 2:(1)they are task-speci?c if RV is one of the class-conditional reference vectors;(2)they compress the information from categorical attributes of high dimensionality into single numeric values,

Table4Feature table F after appending the two new cosine distance features from bags of the TYPE and ISBN variables to the class-conditional positive reference bag.The new features show a strong correlation with the class label

t.CID t.CLASS Cosine(RV1TYPE,CV TYPE(t))Cosine(RV1ISBN,CV ISBN(t))

C100.3160.408

C210.9890.942

C310.9480.816

C400.6010.204

and(3)they can capture different notions of similarity if multiple vector distance measures

are used.If the class labels change,the features also will,because the estimates of the

distributions will differ.If there were indeed two different class-conditional distributions,

the case vectors of positive examples would be expected to have smaller distances to the

positive than to the negative class-conditional distribution.The new feature(distance to the

positive class-conditional distribution)will thereby re?ect a strong similarity with respect

to the task.This can be observed in Table4.Only if the two class distributions are indeed

identical should the difference in the distances be close to zero.

3.3.3.Simpler but related aggregate features

An alternative solution to deal with bags of values from high-cardinality attributes is to

select a smaller subset of values for which the counts are used as new features.This poses

the question of a suitable criterion for selection,and the distributional meta-data can be

brought to bear.For example,a simple selection criterion is high overall frequency of a

value.ACORA constructs in addition to the vector-distance features,the top n values v for

which DV*(N(v))was largest.

However,the principles in Section2suggest choosing the most discriminative values

based on the target prediction task.Speci?cally,ACORA uses the class-conditional reference

vectors RV0and RV1(or the distributions DV0and DV1)to select those that show the largest

absolute values for RV1?RV0.For example,the most discriminative TYPE value in the example is‘Fiction’with a difference of1.5in Table3.

For numeric attributes,ACORA provides straightforward aggregates:MIN,MAX,SUM,

MEAN,and VARIANCE It also discretizes numeric attributes(equal-frequency binning)

and estimates class-conditional distributions and distances,similar to the procedure for

categorical attributes described in Section3.3.1.This aggregation makes no prior assumptions

about the distributions(e.g.,normality)and can capture arbitrary numeric densities.We do

not assess this capability in this paper.

3.4.A Bayesian justi?cation:A relational?xed-effect model

We suggested distance-based aggregates to address a particular problem:the aggregation of

categorical variables of high cardinality.The empirical results in Section4provide support

that distribution-based aggregates can indeed condense information from such attributes

and improve generalization performance signi?cantly over alternative aggregates,such as

counts for the n most common values.Aside from empirical evidence of superior modeling

performance,we now show that the distance-based aggregation operators can be derived as

components of a“relational?xed-effect model”with a Bayesian foundation.

Statistical estimation contrasts random-effect models with?xed-effect models (DerSimonian&Laird,1986).In a random-effect model,the model parameters are not assumed to be constant but instead to be drawn randomly from a distribution for different observations.An analogy can be drawn to the difference between our aggregates and the traditional simple aggregates from Section3.2.Simple aggregates estimate parameters from a distribution for each bag.This is similar to a random effect model.Our aggregates on the other hand can be seen as a relational?xed-effect model:we assume the existence of only two?xed distributions,one for each of the two classes.6Under this assumption the number of parameters decreases by a factor of n/2where n is the number of training examples. More speci?cally,we de?ne a relational?xed-effect model with the assumption that all bags of objects related to positive target cases are sampled from one distribution DV1and all objects related to negative target cases are drawn from another distribution DV0.Thus it may become possible to compute reliable estimates of reference distributions DV1and DV0, even in the case of categorical attributes of high cardinality,by combining all bags related to positive/negative cases to estimate DV1/DV0.

In a relational context,a target object t is described not only by its own attributes.It also has an identi?er(CID in our example)that maps into bags of related objects from different background tables.Given a target object t with a feature vector7and a set of bags of related objects from different relationships(t.1,...,t.n t,B R(t),...,B Q(t)),via Bayes’rule one can express the probability of class c as

P(c|t)=P(c|t.1,...,t.n t,B R(t),...,B Q(t))(9) =P(t.1,...,t.n t,B R(t),...,B Q(t)|c)?P(c)/P(t).(10) Making the familiar assumption of class-conditional independence of the attributes t1,... ,tn t and of all bags B?of related objects allows rewriting the above expression as

P(c|t)=

i P(t.i|c)?

P(B?(t)|c)?P(c)/P(t).(11)

Assuming that the elements r of each particular bag of related objects B R(t)are drawn independently,we can rewrite the probability of observing that bag as

P(B R|c)=

r∈B R(t)

P(r|c).(12)

Assuming again class-conditional independence of all attributes r1,.........,rn p of all related entities r,we can?nally estimate the class-conditional probability of a bag of values from the training data as

P(B R(t)|c)=

r∈B R(t)?

P(r.j|c).(13)

6More than two distributions would be used for multiclass problems,or could be generated via domain knowledge or clustering.

7Excluding the class label and the identi?er.

Switching the order of the product this term can be rewritten as a product over all attributes over all samples:

P(B R(t)|c)=

j ?

r.j∈B R.j(t)

P(r.j|c).(14)

This second part of the product has a clear probabilistic interpretation:it is the non-normalized(ignoring P(c)and P(t))probability of observing the bag of values B R.j(t)of attribute R.j given the class c This non-normalized conditional probability or likelihood can be seen as a particular choice of vector distance8and can be estimated in the previous notation as:

L H(DV c,CV)=P(B R.j(t)|c)=

r.j∈B R.j(t)P(r.j|c)=

DV c[i]CV[i](15)

where i ranges over the set of possible values for the bagged attribute.

Thus,for the particular choice of likelihood(LH)as the distance function,ACORA’s aggregation approach can be given a theoretical foundation within a general relational Bayesian framework similar to that of Flach and Lachiche(2004).

This derivation not only provides one theoretical justi?cation for our more general framework of using(multiple)vector distances in combination with class-conditional distribution estimates.It also highlights the three inherent assumptions of the approach:

(1)class-conditional independence between attributes(and identi?ers)of the target cases,

(2)class-conditional independence between related entities,and(3)class-conditional inde-pendence between the attributes of related objects.Strong violations are likely to decrease the predictive performance.It is straightforward to extend the expressiveness of ACORA to weaken the?rst assumption,by(for example)combining pairs of feature values prior to aggregation.The second assumption,of random draws,is more fundamental to aggregation in general and less easily addressed.Relaxing this assumption will come typically at a price: modeling will become increasingly prone to over?tting because the search space expands rapidly.This calls for strong constraints on the search space,as typically are provided for ILP systems in the declarative language bias.We discussed this tradeoff previously(Perlich &Provost,2003)in the context of noisy domains.

3.5.Implications for learning from identi?er attributes

We show in our empirical results in Section4the importance of including aggregates of identi?ers.The following discussion is an analysis of the special properties of identi?ers and why aggregates of identi?ers and in particular additive distances like cosine can achieve such performance improvements.

Identi?ers are categorical attributes with high cardinality.In our example problem we have two such attributes:CID,the identi?er of customers,and ISBN,the identi?er of books. The task of classifying customers based on the target table T clearly calls for the removal of the unique CID attribute prior to model induction,because it cannot generalize.However,the identi?ers of related objects may be predictive out-of-sample.For example,buying Jacques P′e pin’s latest book may increase the estimated likelihood that the customer would join a 8Actually a similarity.

cookbook club;in a different domain,calling from a particular cell site(location)may greatly increase the likelihood of fraud.Such identi?ers may be shared across multiple target cases that are related to the same objects(e.g.,customers who bought the same book).The corresponding increase in the effective number of appearances of the related-object identi?er attribute R.j,such as ISBN,allows the estimation of class-conditional probabilities P(r.j|c).

Beyond the immediate relevance of particular identities(e.g.,P′e pin’s book),identi?er attributes have a special property:they represent implicitly all characteristics of an object. Indeed,the identity of a related object(a particular cell site)can be more important than any set of available attributes describing that object.This has important implications for modeling: using identi?er attributes can overcome the limitations of class-conditional independence in Eq.(12)and even permits learning from unobserved characteristics.

An object identi?er Rj like ISBN stands for all characteristics of the an object.If observed, these characteristics would appear in another table S as attributes(S.1,...,S.n s).Technically, there exists a functional mapping9F that maps the identi?er to a set of values:F(r.j)→(s.1, ...,s.n s).We can express the joint class-conditional probability(without the independence assumption)of a particular object feature-vector without the identi?er?eld as the sum of the class-conditional probabilities of all objects(represented by their identi?ers r.j)with the same feature vector:

P(s.1,...,s.n s|c)=

P(r.j|c)(16)

{r.j|F(r.j)=(s.1,...,s.n r)}

If F is an isomorphism(i.e.,no two objects have the same feature vector)the sum vanishes and P(s.1,...,s.n s|c)=P(r.j|c).Estimating P(r.j|c)therefore provides information about the joint probability of all its attributes(s1,...,s n s).

A similar argument can be made for an unobserved attribute s.u(e.g.,association with an organization engaging in fraud).In fact,it may be the case that no attribute of the object s was observed and no table S was recorded,as is the case for ISBN in our example.There is nevertheless the dependency F (r.j)→su,for some function F ,and the relevant class-conditional probability is equal to the sum over all identi?ers with the same(unobserved) value:

P(r.j|c).(17) P(s.u|c)=

{r.j|F (r.j)=s.u}

Given that s.u is not observable,it is impossible to decide which elements belong into the sum.If however s.u is a perfect predictor—i.e.,every value of su appears only for objects related to target cases of one class c—the class-conditional probability P(r.j|c)will be non-zero for only one class c In that case the restricted sum in Equ(17)is equal to the total sum over the class-conditional probabilities of all identi?er values:

9This function F does not need to be known;it is suf?cient that it exists.

{r|F (r.j)=s.u}P(r.j|c)=

P(r.j|c).(18)

Note that the total sum over the class-conditional probabilities of all related identi?er values now equals the cosine distance between DV c and a special case vector CV su that corresponds to a bag containing all identi?ers with value su prior to normalization10by vector length, because DV c[N(r.j)]is an estimate of P(r.j|c)and CV[N(r.j)]is typically1or0for identi?er attributes such as ISBN.The cosine distance for a particular bag CV(t)is a biased11 estimate of P(s.u|c)since the bag will typically only consist of a subset of all identi?ers with value s.u

cosine(DV c R.j,CV)=

||CV||

DV c R.j[i]?CV R.j[i](19)

So far we have assumed a perfect predictor attribute S.u.The overlap between the two class-conditional distributions DV0and DV1of the identi?er is a measure of the pre-dictive power of Su and also how strongly the total sum in the cosine distance deviates from the correct restricted sum in Eq.(17).The relationship between the class-conditional probability of an unobserved attribute and the cosine distance on the identi?er may be the reason why the cosine distance performs better than likelihood in the experiments in Section4.

Although this view is promising,issues remain.It maybe hard to estimate P(r.j|c)due to the lack of suf?cient data(it is also much harder to estimate the joint rather than a set of independent distributions).We often do not need to estimate the entire joint distribution because the true concept is an unknown class-conditional dependence between only a few attributes.Finally the degree of overlap between the two class-conditional distributions DV0 and DV1determines how effectively we can learn from unobserved attributes.Nevertheless, the ability to account for identi?ers through aggregation can extend the expressive power signi?cantly as shown empirically in Section4.

Identi?ers have other interesting properties.Identities may often be the cause of relational autocorrelation(Jensen&Neville,2002).Because a customer bought the?rst part of the trilogy,he now wants to read how the story continues.Given such a concept,we expect to see autocorrelation between customers who are linked through books.In addition to the identi?er proxying for all object characteristics of immediately related entities(e.g.,the authors of a book),it also contains the implicit information about all other objects linked to it (e.g.,all the other books written by the same author).An identi?er therefore may introduce a “natural”Markov barrier that reduces or eliminates the need to extend the search for related entities further than to the direct neighbors.12We present some evidence of this phenomenon in Section4.3.3.

10The effect of normalization can be neglected,since the length of DV c is1and the length of CV is the same for both the class-conditional positive and class-conditional negative cosine distances.

11We underestimate P(s.u|c)as a function of the size of the bag.The smaller the bag,the more elements of the sum are0and the larger the bias.

12In cases with strong class-label autocorrelation,such a barrier often can be provided by class labels of related instances(Jensen et al.,2004;Macskassy&Provost,2004).

Table5Summary of the properties of the eight domains,including the tables,their sizes,their attributes, types,and the training and test sizes used in the main experiments.C(y)is the cardinality of a categorical attribute and D(y)=R identi?es numeric attributes

Domain Table:Size Attribute type description Size

XOR T:10000C(tid)=10000,C(c)=2Train:8000 O:55000C(oid)=1000,C(tid)=10000Test:2000

AND T:10000C(tid)=10000,C(c)=2Train:8000 O:55000C(oid)=1000,C(tid)=10000Test:2000

Fraud T:100000C(tid)=100000Train:50000 R:1551000C(tid)=100000,C(tid)=100000Test:50000 KDD T:59600C(tid)=59600,C(c)=2Train:8000 TR:146800C(oid)=490,C(tid)=59600Test:2000

IPO T:2790C(tid)=2790,C(e)=6,C(sic)=415,C(c)=2Train:2000

D(d,s,p,r)=R Test:800 H:3650C(tid)=2790,C(bid)=490

U:2700C(tid)=2790,C(bid)=490

COOC T:1860C(tid)=1860,C(c)=2Train:1000 R:50600C(tid)=1860,C(tid)=1860Test:800 CORA T:4200C(tid)=4200,C(c)=2Train:3000 A:9300C(tid)=4200,C(aid)=4000Test:1000

R:91000C(tid)=4200,C(tid)=35000

EBook T:19000C(tid)=19000,C(c,b,m,k)=2,D(a,y,e)=R Train:8000 TR:54500C(oid)=22800,C(tid)=19000,D(p)=R,C(c)=5Test:2000 4.Empirical results

After describing the experimental setup,Section4.3presents empirical evidence in support of our main claims regarding the generalization performance of the new aggregates.Then we present a sensitivity analysis of the factors in?uencing the results(Section4.4).

4.1.Domains

Our experiments are based on eight relational domains that are described in more detail in Appendix B.They typical are transaction or networked-entity domains with predominantly categorical attributes of high cardinality.The?rst two domains(XOR and AND)are arti?cial, and were designed to illustrate simple cases where the concepts are based on(combinations of)unobserved attributes.Variations of these domains are also used for the sensitivity analysis later.Fraud is also a synthetic domain,designed to represent a real-world problem (telecommunications fraud detection),where target-object identi?ers(particular telephone numbers)have been used in practice for classi?cation(Fawcett&Provost,1997;Cortes et al.,2002).The remaining domains include data from real-world domains that satisfy the criteria of having interconnected entities.An overview of the number of tables,the number of objects,and the attribute types is given in Table5.The equality relation of the types is implied by identical attribute names.

4.2.Methodology

Our main objective is to demonstrate that distribution-based vector distances for aggregation generalize when simple aggregates like MODE or COUNT v for all values are inapplicable or inadequate.In order to provide a solid baseline we extend these simple aggregates for use in the presence of attributes with high cardinality:ACORA constructs COUNT v for the10 most common values(an extended MODE based on the meta-data)and counts for all values if the number of distinct values is at most50,as suggested by Krogel and Wrobel(2003). ACORA generally includes an attribute for each bag representing the bag size as well as all original attributes from the target table.

Feature construction.Table6summarizes the different aggregation methods.ACORA uses 50%of the training set for the estimation of class-conditional reference vectors and the other50%for model estimation.The model estimation cannot be done on the same data set that was used for construction,since the use of the target during construction would lead to overestimation of the predictive performance.We also include distances from bags to the unconditional distribution(estimates calculated on the full training set).Unless otherwise noted,for the experiments the stopping criterion for the exploration is depth=1,meaning for these domains that each background table is used once.The cutoff for identi?er attributes I MIN was set to400.

Model estimation.We use WEKA’s logistic regression(Witten&Frank,1999)to estimate probabilities of class membership from all https://www.wendangku.net/doc/4b1419947.html,ing decision trees(including the differences of distances as suggested in Section4.3.1)did not change the relative performance between different aggregation methods signi?cantly,but generally performed worse than logistic regression.We did not use feature selection for the presented results;feature selection did not change the relative performance,since for these runs the number of constructed features remains relatively small.

Evaluation.The generalization performance is evaluated in terms of the AUC:area under the ROC curve(Bradley,1997).All results represent out-of-sample generalization performance on test sets averaged over10runs.The objects in the target table for each run are split randomly into a training set and a test set(cf.,Table5).We show error bars of±one standard deviation in the?gures and include the standard deviation in the tables in parentheses.

4.3.Main results

We now analyze the relative generalization performance of different aggregation opera-tors.Our main claim that class-conditional,distribution-based aggregates add generalization power to classi?cation with high-dimensional categorical variables was motivated by three arguments that are considered in the sequel:

r Target-dependent aggregates,such as vector distances to class-conditional reference vec-

tors,exhibit task-speci?c similarity;

r This task-speci?c similarity,in combination with the instance discriminability conferred

by using numeric distances,improves generalization performance;

r Aggregating based on vector distances allows learning from identi?er attributes,which hold certain special properties(viz.,proxying for:unseen features,interdependent features,and information farther away in the network).Coalescing information from many identi?ers can improve over including only particular identi?ers.

Table6Summary of aggregation operators used in the experiments,grouped by type:counts for particular categorical values,different vector distances,combinations of vector distances to conditional or unconditional reference distributions,where t denotes a target case

Method Description

COUNTS ACORA constructs count features for all possible categorical values if the number of values is less than50.In particular this excludes all key attributes.

MCC Counts for the10most common categorical values(values with largest entries in

unconditional reference bag B*).MCC can be applied to all categorical attributes

including identi?ers.

MDC Counts for the10most discriminative categorical values(Section3.3.3)de?ned as the values with the largest absolute difference in the vector B1?B0.MDC can be applied to all

categorical attributes including identi?ers.

Cosine Cosine(DV1,CV(t)),Cosine(DV0,CV(t))

Mahalanobis Mahalanobis(RV1,CV(t)),Mahalanobis(RV0,CV(t))

Euclidean Euclidean(RV1,CV(t)),Euclidean(RV0,CV(t))

Likelihood Likelihood(DV1,CV(t)),Likelihood(DV0,CV(t))

UCVD All unconditional vector distances:

Cosine(DV?*,CV(t)),Mahalanobis(DV?*,CV(t)),Euclidean(DV?*,CV(t)),

Likelihood(DV,CV(t))

CCVD All class-conditional vector distances:

Cosine(DV1,CV(t)),Cosine(DV0,CV(t)),Euclidean(DV1,CV(t)),Euclidean(DV0,CV(t)),

Mahalanobis(DV1,CV(t)),Mahalanobis(DV0,CV(t)),Likelihood(DV1,CV(t)),

Likelihood(DV0,CV(t))

DCCVD All differences of class-conditional vector distances:

Cosine(DV1,CV(t))–Cosine(DV0,CV(t)),

Mahalanobis(DV1,CV(t))–Mahalanobis(DV0,CV(t)),

Euclidean(DV1,CV(t))–Euclidean(DV0,CV(t)),

Likelihood(DV1,CV(t))–Likelihood(DV0,CV(t))

We also argued that using multiple aggregates can improve generalization performance. As we will see,this point is not supported as strongly by the experimental results.

4.3.1.Task-speci?c similarity

We argued in Section2that task-speci?c aggregates have the potential to identify discrim-inative information because they exhibit task-speci?c similarity(making positive instances of related bags similar to each other).For the XOR problem,Figure4shows on the left the two-dimensional instance space de?ned by using as attributes two class-conditional aggre-gations of identi?ers of related entities:the cosine distance to the positive distribution and the cosine distance to the negative distribution.Although the positive target objects each has a different bag of identi?ers,when using the constructed attributes the positive objects are similar to each other(left-upper half)and the negative are similar to each other(right-lower half).

Importantly,it also is clear from the?gure that although positive target cases have on average a larger cosine distance to the positive class-conditional distribution(they are mostly

With的用法全解

With的用法全解 with结构是许多英语复合结构中最常用的一种。学好它对学好复合宾语结构、不定式复合结构、动名词复合结构和独立主格结构均能起很重要的作用。本文就此的构成、特点及用法等作一较全面阐述，以帮助同学们掌握这一重要的语法知识。一、 with结构的构成它是由介词with或without+复合结构构成，复合结构作介词with或without的复合宾语，复合宾语中第一部分宾语由名词或代词充当，第二部分补足语由形容词、副词、介词短语、动词不定式或分词充当，分词可以是现在分词，也可以是过去分词。With结构构成方式如下： 1. with或without-名词/代词+形容词； 2. with或without-名词/代词+副词； 3. with或without-名词/代词+介词短语； 4. with或without-名词/代词 +动词不定式； 5. with或without-名词/代词 +分词。下面分别举例： 1、 She came into the room，with her nose red because of cold.（with+名词+形容词，作伴随状语）

2、 With the meal over ， we all went home.（with+名词+副词，作时间状语） 3、The master was walking up and down with the ruler under his arm。（with+名词+介词短语，作伴随状语。） The teacher entered the classroom with a book in his hand. 4、He lay in the dark empty house，with not a man ，woman or child to say he was kind to me.（with+名词+不定式，作伴随状语）He could not finish it without me to help him.（without+代词 +不定式，作条件状语） 5、She fell asleep with the light burning.（with+名词+现在分词，作伴随状语） Without anything left in the with结构是许多英语复合结构中最常用的一种。学好它对学好复合宾语结构、不定式复合结构、动名词复合结构和独立主格结构均能起很重要的作用。本文就此的构成、特点及用法等作一较全面阐述，以帮助同学们掌握这一重要的语法知识。二、with结构的用法 with是介词，其意义颇多，一时难掌握。为帮助大家理清头绪，以教材中的句子为例，进行分类，并配以简单的解释。在句子中with结构多数充当状语，表示行为方式，伴随情况、时间、原因或条件（详见上述例句）。 1.带着，牵着…… (表动作特征)。如： Run with the kite like this.

with复合结构专项练习96126

with复合结构专项练习（二）一请选择最佳答案 1）With nothing_______to burn，the fire became weak and finally died out. A.leaving B.left C.leave D.to leave 2）The girl sat there quite silent and still with her eyes_______on the wall. A.fixing B.fixed C.to be fixing D.to be fixed 3）I live in the house with its door_________to the south.（这里with结构作定语） A.facing B.faces C.faced D.being faced 4）They pretended to be working hard all night with their lights____. A.burn B.burnt C.burning D.to burn 二：用with复合结构完成下列句子 1）_____________（有很多工作要做），I couldn't go to see the doctor. 2）She sat__________（低着头）。 3）The day was bright_____.（微风吹拂） 4）_________________________，（心存梦想）he went to Hollywood. 三把下列句子中的划线部分改写成with复合结构。 1）Because our lessons were over，we went to play football. _____________________________. 2）The children came running towards us and held some flowers in their hands. _____________________________. 3）My mother is ill，so I won't be able to go on holiday. _____________________________. 4）An exam will be held tomorrow，so I couldn't go to the cinema tonight. _____________________________.

with的用法大全

with的用法大全----四级专项训练with结构是许多英语复合结构中最常用的一种。学好它对学好复合宾语结构、不定式复合结构、动名词复合结构和独立主格结构均能起很重要的作用。本文就此的构成、特点及用法等作一较全面阐述，以帮助同学们掌握这一重要的语法知识。一、 with结构的构成它是由介词with或without+复合结构构成，复合结构作介词with或without的复合宾语，复合宾语中第一部分宾语由名词或代词充当，第二部分补足语由形容词、副词、介词短语、动词不定式或分词充当，分词可以是现在分词，也可以是过去分词。With结构构成方式如下： 1. with或without-名词/代词+形容词; 2. with或without-名词/代词+副词; 3. with或without-名词/代词+介词短语; 4. with或without-名词/代词+动词不定式; 5. with或without-名词/代词+分词。下面分别举例：

1、 She came into the room，with her nose red because of cold.(with+名词+形容词，作伴随状语) 2、 With the meal over ， we all went home.(with+名词+副词，作时间状语) 3、The master was walking up and down with the ruler under his arm。(with+名词+介词短语，作伴随状语。) The teacher entered the classroom with a book in his hand. 4、He lay in the dark empty house，with not a man ，woman or child to say he was kind to me.(with+名词+不定式，作伴随状语) He could not finish it without me to help him.(without+代词 +不定式，作条件状语) 5、She fell asleep with the light burning.(with+名词+现在分词，作伴随状语) 6、Without anything left in the cupboard， she went out to get something to eat.(without+代词+过去分词，作为原因状语) 二、with结构的用法在句子中with结构多数充当状语，表示行为方式，伴随情况、时间、原因或条件(详见上述例句)。

with用法归纳

with用法归纳（1）“用……”表示使用工具，手段等。例如： ①We can walk with our legs and feet. 我们用腿脚行走。 ②He writes with a pencil. 他用铅笔写。（2）“和……在一起”，表示伴随。例如： ①Can you go to a movie with me? 你能和我一起去看电影'>电影吗？ ②He often goes to the library with Jenny. 他常和詹妮一起去图书馆。（3）“与……”。例如： I’d like to have a talk with you. 我很想和你说句话。（4）“关于，对于”，表示一种关系或适应范围。例如： What’s wrong with your watch? 你的手表怎么了？（5）“带有，具有”。例如： ①He’s a tall kid with short hair. 他是个长着一头短发的高个子小孩。 ②They have no money with them. 他们没带钱。（6）“在……方面”。例如： Kate helps me with my English. 凯特帮我学英语。（7）“随着，与……同时”。例如： With these words, he left the room. 说完这些话，他离开了房间。 [解题过程] with结构也称为with复合结构。是由with+复合宾语组成。常在句中做状语，表示谓语动作发生的伴随情况、时间、原因、方式等。其构成有下列几种情形： 1.with+名词(或代词)+现在分词此时，现在分词和前面的名词或代词是逻辑上的主谓关系。例如:1)With prices going up so fast, we can't afford luxuries. 由于物价上涨很快，我们买不起高档商品。（原因状语） 2)With the crowds cheering, they drove to the palace. 在人群的欢呼声中，他们驱车来到皇宫。（伴随情况） 2.with+名词(或代词)+过去分词此时，过去分词和前面的名词或代词是逻辑上的动宾关系。

with用法小结

with用法小结一、with表拥有某物 Mary married a man with a lot of money . 马莉嫁给了一个有着很多钱的男人。 I often dream of a big house with a nice garden . 我经常梦想有一个带花园的大房子。 The old man lived with a little dog on the lonely island . 这个老人和一条小狗住在荒岛上。二、with表用某种工具或手段 I cut the apple with a sharp knife . 我用一把锋利的刀削平果。 Tom drew the picture with a pencil . 汤母用铅笔画画。三、with表人与人之间的协同关系 make friends with sb talk with sb quarrel with sb struggle with sb fight with sb play with sb work with sb cooperate with sb I have been friends with Tom for ten years since we worked with each other, and I have never quarreled with him . 自从我们一起工作以来，我和汤姆已经是十年的朋友了，我们从没有吵过架。四、with 表原因或理由 John was in bed with high fever . 约翰因发烧卧床。 He jumped up with joy . 他因高兴跳起来。 Father is often excited with wine . 父亲常因白酒变的兴奋。五、with 表“带来”，或“带有,具有”，在…身上，在…身边之意

with复合宾语的用法(20201118215048)

with+复合宾语的用法一、with的复合结构的构成二、所谓"with的复合结构”即是"with+复合宾语”也即"with +宾语+宾语补足语” 的结构。其中的宾语一般由名词充当（有时也可由代词充当）；而宾语补足语则是根据具体的需要由形容词，副词、介词短语，分词短语（包括现在分词和过去分词）及不定式短语充当。下面结合例句就这一结构加以具体的说明。三、1、with +宾语+形容词作宾补四、①He slept well with all the windows open.（82 年高考题）上面句子中形容词open作with的宾词all the windows的补足语， ②It' s impolite to talk with your mouth full of food. 形容词短语full of food 作宾补。Don't sleep with the window ope n in win ter 2、with+宾语+副词作宾补 with Joh n away, we have got more room. He was lying in bed with all his clothes on. ③Her baby is used to sleeping with the light on.句中的on 是副词，作宾语the light 的补足语。 ④The boy can t play with his father in.句中的副词in 作宾补。 3、with+宾语+介词短语。 we sat on the grass with our backs to the wall. his wife came dow n the stairs,with her baby in her arms. They stood with their arms round each other. With tears of joy in her eyes ,she saw her daughter married. ⑤She saw a brook with red flowers and green grass on both sides. 句中介词短语on both sides 作宾语red flowersandgreen grass 的宾补， ⑥There were rows of white houses with trees in front of them.，介词短语in front of them 作宾补。 4、with+宾词+分词（短语这一结构中作宾补用的分词有两种，一是现在分词，二是过去分词，一般来说，当分词所表示的动作跟其前面的宾语之间存在主动关系则用现在分词，若是被动关系，则用过去分词。 ⑦In parts of Asia you must not sit with your feet pointing at another person.（高一第十课），句中用现在分词pointing at…作宾语your feet的补足语，是因它们之间存在主动关系，或者说point 这一动作是your feet发出的。 All the after noon he worked with the door locked. She sat with her head bent. She did not an swer, with her eyes still fixed on the wall. The day was bright,with a fresh breeze（微风）blowing. I won't be able to go on holiday with my mother being ill. With win ter coming on ,it is time to buy warm clothes. He soon fell asleep with the light still bur ning. ⑧From space the earth looks like ahuge water covered globe,with a few patches of land stuk ing out above the water而在下面句子中因with的宾语跟其宾补之间存在被动关系，故用过去分词作宾补：

(完整版)with的复合结构用法及练习

with复合结构一. with复合结构的常见形式 1.“with+名词/代词+介词短语”。 The man was walking on the street, with a book under his arm. 那人在街上走着，腋下夹着一本书。 2. “with+名词/代词+形容词”。 With the weather so close and stuffy, ten to one it’ll rain presently. 天气这么闷热，十之八九要下雨。 3. “with+名词/代词+副词”。 The square looks more beautiful than even with all the light on. 所有的灯亮起来，广场看起来更美。 4. “with+名词/代词+名词”。 He left home, with his wife a hopeless soul. 他走了，妻子十分伤心。 5. “with+名词/代词+done”。此结构过去分词和宾语是被动关系，表示动作已经完成。 With this problem solved, neomycin 1 is now in regular production. 随着这个问题的解决，新霉素一号现在已经正式产生。 6. “with+名词/代词+-ing分词”。此结构强调名词是-ing分词的动作的发出者或某动作、状态正在进行。 He felt more uneasy with the whole class staring at him. 全班同学看着他，他感到更不自然了。 7. “with+宾语+to do”。此结构中，不定式和宾语是被动关系，表示尚未发生的动作。 So in the afternoon, with nothing to do, I went on a round of the bookshops. 由于下午无事可做，我就去书店转了转。二. with复合结构的句法功能 1. with 复合结构，在句中表状态或说明背景情况，常做伴随、方式、原因、条件等状语。With machinery to do all the work, they will soon have got in the crops. 由于所有的工作都是由机器进行，他们将很快收完庄稼。（原因状语） The boy always sleeps with his head on the arm. 这个孩子总是头枕着胳膊睡觉。（伴随状语）The soldier had him stand with his back to his father. 士兵要他背对着他父亲站着。（方式状语）With spring coming on, trees turn green. 春天到了，树变绿了。（时间状语） 2. with 复合结构可以作定语 Anyone with its eyes in his head can see it’s exactly like a rope. 任何一个头上长着眼睛的人都能看出它完全像一条绳子。【高考链接】 1. ___two exams to worry about, I have to work really hard this weekend.（04北京） A. With B. Besides C. As for D. Because of 【解析】A。“with+宾语+不定式”作状语，表示原因。 2. It was a pity that the great writer died, ______his works unfinished. (04福建) A. for B. with C. from D.of 【解析】B。“with+宾语+过去分词”在句中作状语，表示状态。 3._____production up by 60%, the company has had another excellent year. (NMET) A. As B.For C. With D.Through 【解析】C。“with+宾语+副词”在句中作状语，表示程度。

With复合结构的用法小结

With复合结构的用法小结 with结构是许多英语复合结构中最常用的一种。学好它对学好复合宾语结构、不定式复合结构、动名词复合结构和独立主格结构均能起很重要的作用。本文就此的构成、特点及用法等作一较全面阐述，以帮助同学们掌握这一重要的语法知识。一、with结构的构成它是由介词with或without+复合结构构成，复合结构作介词with或without的复合宾语，复合宾语中第一部分宾语由名词或代词充当，第二部分补足语由形容词、副词、介词短语、动词不定式或分词充当，分词可以是现在分词，也可以是过去分词。With结构构成方式如下： 1. with或without-名词/代词+形容词； 2. with或without-名词/代词+副词； 3. with或without-名词/代词+介词短语； 4. with或without-名词/代词+动词不定式； 5. with或without-名词/代词+分词。下面分别举例： 1、She came into the room，with her nose red because of cold.（with+名词+形容词，作伴随状语） 2、With the meal over ，we all went home.（with+名词+副词，作时间状语） 3、The master was walking up and down with the ruler under his arm。（with+名词+介词短语，作伴随状语。）The teacher entered the classroom with a book in his hand. 4、He lay in the dark empty house，with not a man ，woman or child to say he was kind to me.（with+名词+不定式，作伴随状语）He could not finish it without me to help him.（without+代词+不定式，作条件状语） 5、She fell asleep with the light burning.（with+名词+现在分词，作伴随状语）Without anything left in the cupboard，shewent out to get something to eat.（without+代词+过去分词，作为原因状语）二、with结构的用法在句子中with结构多数充当状语，表示行为方式，伴随情况、时间、原因或条件（详见上述例句）。 With结构在句中也可以作定语。例如： 1.I like eating the mooncakes with eggs. 2.From space the earth looks like a huge water-covered globe with a few patches of land sticking out above the water. 3.A little boy with two of his front teeth missing ran into the house. 三、with结构的特点 1. with结构由介词with或without+复合结构构成。复合结构中第一部分与第二部分语法上是宾语和宾语补足语关系，而在逻辑上，却具有主谓关系，也就是说，可以用第一部分作主语，第二部分作谓语，构成一个句子。例如：With him taken care of，we felt quite relieved.（欣慰）→（He was taken good care of.）She fell asleep with the light burning. →（The light was burning.）With her hair gone，there could be no use for them. →（Her hair was gone.） 2. 在with结构中，第一部分为人称代词时，则该用宾格代词。例如：He could not finish it without me to help him. 四、几点说明： 1. with结构在句子中的位置：with 结构在句中作状语，表示时间、条件、原因时一般放在

介词with的用法大全

介词with的用法大全 With是个介词，基本的意思是“用”，但它也可以协助构成一个极为多采多姿的句型，在句子中起两种作用；副词与形容词。 with在下列结构中起副词作用： 1.“with+宾语+现在分词或短语”，如： (1) This article deals with common social ills, with particular attention being paid to vandalism. 2.“with+宾语+过去分词或短语”，如： (2) With different techniques used, different results can be obtained. (3) The TV mechanic entered the factory with tools carried in both hands. 3.“with+宾语+形容词或短语”，如： (4) With so much water vapour present in the room, some iron-made utensils have become rusty easily. (5) Every night, Helen sleeps with all the windows open. 4.“with+宾语＋介词短语”，如： (6) With the school badge on his shirt, he looks all the more serious. (7) With the security guard near the gate no bad character could do any thing illegal. 5.“with+宾语＋副词虚词”，如： (8) You cannot leave the machine there with electric power on. (9) How can you lock the door with your guests in? 上面五种“with”结构的副词功能，相当普遍，尤其是在科技英语中。接着谈“with”结构的形容词功能，有下列五种：一、“with+宾语+现在分词或短语”，如： (10) The body with a constant force acting on it. moves at constant pace. (11) Can you see the huge box with a long handle attaching to it ? 二、“with+宾语+过去分词或短语” (12) Throw away the container with its cover sealed. (13) Atoms with the outer layer filled with electrons do not form compounds. 三、“with+宾语+形容词或短语”，如： (14) Put the documents in the filing container with all the drawers open.

with_的复合结构

with without 引导的独立主格结构介词with without +宾语+宾语的补足语可以构成独立主格结构，上面讨论过的独立主格结构的几种情况在此结构中都能体现。 A．with+名词代词+形容词 He doesn’t like to sleep with the windows open. 他不喜欢开着窗子睡觉。 = He doesn’t like to sleep when the windows are open. He stood in the rain, with his clothes wet. 他站在雨中，衣服湿透了。 = He stood in the rain, and his clothes were wet. 注意：在“with+名词代词+形容词”构成的独立主格结构中，也可用已形容词化的-ing 形式或-ed形式。 With his son so disappointing，the old man felt unhappy. 由于儿子如此令人失望,老人感到很不快乐。 With his father well-known, the boy didn’t want to study. 父亲如此出名，儿子不想读书。 B．with+名词代词+副词 Our school looks even more beautiful with all the lights on. 所有的灯都打开时，我们的学校看上去更美。 = Our school looks even more beautiful if when all the lights are on. The boy was walking, with his father ahead. 父亲在前，小孩在后走着。 = The boy was walking and his father was ahead. C．with+名词代词+介词短语 He stood at the door, with a computer in his hand. 或 He stood at the door, computer in hand. 他站在门口，手里拿着一部电脑。 = He stood at the door, and a computer was in his hand. Vincent sat at the desk, with a pen in his mouth. 或 Vincent sat at the desk, pen in mouth. 文森特坐在课桌前，嘴里衔着一支笔。 = Vincent sat at the desk, and he had a pen in his mouth. D．with+名词代词+动词的-ed形式 With his homework done, Peter went out to play. 作业做好了，彼得出去玩了。 = When his homework was done, Peter went out to play. With the signal given, the train started. 信号发出了，火车开始起动了。 = After the signal was given, the train started. I wouldn’t dare go home without the job finished. 工作还没完成，我不敢回家。 = I wouldn’t dare go home because the job was not finished.

【初中英语】with的用法

【With的基本用法与独立主格】 with结构是许多英语复合结构中最常用的一种。学好它对学好复合宾语结构、不定式复合结构、动名词复合结构和独立主格结构均能起很重要的作用。一、with结构的构成它是由介词with或without+复合结构构成，复合结构作介词with或without的复合宾语，复合宾语中第一部分宾语由名词或代词充当，第二部分补足语由形容词、副词、介词短语、动词不定式或分词充当，分词可以是现在分词，也可以是过去分词。With结构构成方式如下： 1. with或without-名词/代词+形容词； 2. with或without-名词/代词+副词； 3. with或without-名词/代词+介词短语； 4. with或without-名词/代词+动词不定式； 5. with或without-名词/代词+分词。下面分别举例： 1、She came into the room，with her nose red because of cold.（with+名词+形容词，作伴随状语） 2、With the meal over, we all went home.（with+名词+副词，作时间状语） 3、The master was walking up and down with the ruler under his arm。（with+名词+介词短语，作伴随状语。） 4、He could not finish it without me to help him.（without+代词+不定式，作条件状语） 5、She fell asleep with the light on.（with+名词+现在分词，作伴随状语）二、with结构的用法 with是介词，其意义颇多，一时难掌握。为帮助大家理清头绪，以教材中的句子为例，进行分类，并配以简单的解释。在句子中with结构多数充当状语，表示行为方式，伴随情况、时间、原因或条件（详见上述例句）。 1. 带着，牵着……(表动作特征)。如： Run with the kite like this. 2. 附加、附带着……(表事物特征)。如： A glass of apple juice, two glasses of coke, two hamburgers with potato chips, rice and fish. 3. 和……(某人)一起。 a. 跟某人一起(居住、吃、喝、玩、交谈……) 。如： Now I am in China with my parents. Sometimes we go out to eat with our friends. He / She's talking with a friend. b. 跟go, come 连用，有"加入"到某方的意思。如： Do you want to come with me? 4. 和play一起构成短语动词play with 意为"玩耍……，玩弄……" 如： Two boys are playing with their yo-yos. 5. 与help 一起构成help...with...句式，意为"帮助(某人) 做(某事)"。如： On Monday and Wednesday, he helps his friends with their English. 6. 表示面部神情，有“含着……，带着……”如： "I'm late for school," said Sun Y ang, with tears in his eyes. 7. 表示"用……" 如：

with的复合结构用法及练习

页眉内容 with复合结构一. with复合结构的常见形式 1.“with+名词/代词+介词短语”。 The man was walking on the street, with a book under his arm. 那人在街上走着，腋下夹着一本书。 2. “with+名词/代词+形容词”。 With the weather so close and stuffy, ten to one it’ll rain presently. 天气这么闷热，十之八九要下雨。 3. “with+名词/代词+副词”。 The square looks more beautiful than even with all the light on. 所有的灯亮起来，广场看起来更美。 4. “with+名词/代词+名词”。 He left home, with his wife a hopeless soul. 他走了，妻子十分伤心。 5. “with+名词/代词+done”。此结构过去分词和宾语是被动关系，表示动作已经完成。 With this problem solved, neomycin 1 is now in regular production. 随着这个问题的解决，新霉素一号现在已经正式产生。 6. “with+名词/代词+-ing分词”。此结构强调名词是-ing分词的动作的发出者或某动作、状态正在进行。 He felt more uneasy with the whole class staring at him. 全班同学看着他，他感到更不自然了。7. “with+宾语+to do”。此结构中，不定式和宾语是被动关系，表示尚未发生的动作。 So in the afternoon, with nothing to do, I went on a round of the bookshops. 由于下午无事可做，我就去书店转了转。二. with复合结构的句法功能 1. with 复合结构，在句中表状态或说明背景情况，常做伴随、方式、原因、条件等状语。With machinery to do all the work, they will soon have got in the crops. 由于所有的工作都是由机器进行，他们将很快收完庄稼。（原因状语） The boy always sleeps with his head on the arm. 这个孩子总是头枕着胳膊睡觉。（伴随状语）The soldier had him stand with his back to his father. 士兵要他背对着他父亲站着。（方式状语）With spring coming on, trees turn green. 春天到了，树变绿了。（时间状语） 2. with 复合结构可以作定语 Anyone with its eyes in his head can see it’s exactly like a rope. 任何一个头上长着眼睛的人都能看出它完全像一条绳子。【高考链接】 1. ___two exams to worry about, I have to work really hard this weekend.（04北京） A. With B. Besides C. As for D. Because of 【解析】A。“with+宾语+不定式”作状语，表示原因。 2. It was a pity that the great writer died, ______his works unfinished. (04福建) A. for B. with C. from D.of 【解析】B。“with+宾语+过去分词”在句中作状语，表示状态。 3._____production up by 60%, the company has had another excellent year. (NMET) A. As B.For C. With D.Through 【解析】C。“with+宾语+副词”在句中作状语，表示程度。

介词with的用法

介词with的用法 1.表示人与人的协同关系，意为“一起”“和” go with 与..一起去 play with 与...一起玩 live with 与...一起住/生活 work with 与...一起工作 make friends with 与....交朋友 talk with sb = talk to sb fight with 与...打架/战斗 cooperate with 与...一起合作 2.表示“带有”“拥有” tea with honey 加蜂蜜的茶 a man with a lot of money 一个有很多钱的人 a house with a big garden 一个带有大花园的房子 a chair with three legs 一张三条腿的椅子 a girl with golden hair 金发的女孩 3.表示“用”某种工具或手段 write with a pencil 用铅笔写字 cut the apple with a knife 用刀切苹果 4.表示“在...身边”“在...身上” I don’t have any money with me. 我身上没带钱。 Take an umbrella with you in case it rains 带把伞以防下雨。 5.表示“在...之下” With the help of sb = with one’s help 在某人的帮助下 6.表示“随着” with the development of ... 随着...的发展 float with the wind 随风飘动

7.常见带有with的动词短语 agree with sb/sth 同意某人或某事deal with sth = do with sth 处理某事 help sb with sth 在...上帮助某人 fall in love with sb/sth 爱上某人/某物 get on with sb 与某人相处 get on well with sb 与某人相处得好 have nothing to do with sb 与某人无关compare A with B 将A和B作比较communicate with sb 与某人交流 argue with sb = quarrel with sb 与某人吵架Have fun with sth 玩的开心 Get away with sth 做坏事不受惩罚 Chat with sb 跟某人闲谈 Charge sb with sth 指控某人。。。 Put up with sth 忍受 8.常见带with的形容词固定搭配 be satisfied with 对...满意 be content with sth 对...满足 be angry with sb 生某人的气 be strict with sb 对某人严格 be patient with sb 对某人有耐心 be popular with sb 受某人欢迎 be filled with sth 装满... 充满..... = be full of sth What’s wrong/the matter with sb/sth be familiar with sb/sth 熟悉某人或某物 be connected with sb/sth 与....有关 Be decorated with 被。。。装饰 Be impressed with/by