当前位置：文档库 › Using a corpus of sentence orderings defined by many experts to evaluate metrics of coheren

Using a corpus of sentence orderings defined by many experts to evaluate metrics of coheren

Using a Corpus of Sentence Orderings De?ned by Many Experts to Evaluate Metrics of Coherence for Text Structuring

Nikiforos Karamanis Computational Linguistics Research Group University of Wolverhampton,UK

N.Karamanis@https://www.wendangku.net/doc/c0165919.html,

Chris Mellish Department of Computing Science University of Aberdeen,UK cmellish@https://www.wendangku.net/doc/c0165919.html,

Abstract

This paper addresses two previously unresolved is-

sues in the automatic evaluation of Text Structuring

(TS)in Natural Language Generation(NLG).First,

we describe how to verify the generality of an exist-

ing collection of sentence orderings de?ned by one

domain expert using data provided by additional

experts.Second,a general evaluation methodol-

ogy is outlined which investigates the previously

unaddressed possibility that there may exist many

optimal solutions for TS in the employed domain.

This methodology is implemented in a set of ex-

periments which identify the most promising can-

didate for TS among several metrics of coherence

previously suggested in the literature.1

1Introduction

Research in NLG focused on problems related to TS from

very early on,[McKeown,1985]being a classic example.

Nowadays,TS continues to be an extremely fruitful?eld of

diverse active research.In this paper,we assume the so-

called search-based approach to TS[Karamanis et al.,2004]

which employs a metric of coherence to select a text struc-

ture among various alternatives.The TS module is hypothe-

sised to simply order a preselected set of information-bearing

items such as sentences[Barzilay et al.,2002;Lapata,2003;

Barzilay and Lee,2004]or database facts[Dimitromanolaki

and Androutsopoulos,2003;Karamanis et al.,2004].

Empirical work on the evaluation of TS has become in-

creasingly automatic and corpus-based.As pointed out by [Karamanis,2003;Barzilay and Lee,2004]inter alia,using corpora for automatic evaluation is motivated by the fact that

employing human informants in extended psycholinguistic

experiments is often simply unfeasible.By contrast,large-

scale automatic corpus-based experimentation takes place

much more easily.

[Lapata,2003]was the?rst to present an experimental set-ting which employs the distance between two orderings to es-timate automatically how close a sentence ordering produced 1Chapter9of[Karamanis,2003]reports the study in more detail.by her probabilistic TS model stands in comparison to order-ings provided by several human judges. [Dimitromanolaki and Androutsopoulos,2003]derived

sets of facts from the database of MPIRO,an NLG system that generates short descriptions of museum artefacts[Isard et al.,2003].Each set consists of6facts each of which cor-responds to a sentence as shown in Figure1.The facts in each set were manually assigned an order to re?ect what a domain expert,i.e.an archaeologist trained in museum la-belling,considered to be the most natural ordering of the corresponding sentences.Patterns of ordering facts were au-tomatically learned from the corpus created by the expert. Then,a classi?cation-based TS approach was implemented and evaluated in comparison to the expert’s orderings.

Database fact Sentence

subclass(ex1,amph)→This exhibit is an amphora.

painted-by(ex1,p-Kleo)→This exhibit was decorated by

the Painter of Kleofrades.

painter-story(p-Kleo,en4049)→The Painter of Kleofrades

used to decorate big vases.

exhibit-depicts(ex1,en914)→This exhibit depicts a warrior performing

splachnoscopy before leaving for the battle. current-location(ex1,wag-mus)→This exhibit is currently displayed

in the Martin von Wagner Museum. museum-country(wag-mus,ger)→The Martin von Wagner Museum

is in Germany.

Figure1:MPIRO database facts corresponding to sentences

A subset of the corpus created by the expert in the previous study(to whom we will henceforth refer as E0)is employed by[Karamanis et al.,2004]who attempt to distinguish be-tween many metrics of coherence with respect to their use-fulness for TS in the same domain.Each human ordering of facts in the corpus is scored by each of these metrics which are then penalised proportionally to the amount of alternative orderings of the same material that are found to score equally to or better than the human ordering.The few metrics which manage to outperform two simple baselines in their overall performance across the corpus emerge as the most suitable candidates for TS in the investigated domain.This method-ology is very similar to the way[Barzilay and Lee,2004] evaluate their probabilistic TS model in comparison to the approach of[Lapata,2003].

Because the data used in the studies of[Dimitromanolaki

and Androutsopoulos,2003]and[Karamanis et al.,2004] are based on the insights of just one expert,an obvious un-resolved question is whether they re?ect general strategies for ordering facts in the domain of interest.This paper ad-dresses this issue by enhancing the dataset used in the two studies with orderings provided by three additional experts. These orderings are then compared with the orders of E0us-ing the methodology of[Lapata,2003].Since E0is found to share a lot of common ground with two of her colleagues in the ordering task,her reliability is veri?ed,while a fourth “stand-alone”expert who uses strategies not shared by any other expert is identi?ed as well.

As in[Lapata,2003],the same dependent variable which allows us to estimate how different the orders of E0are from the orders of her colleagues is used to evaluate some of the metrics which perform best in[Karamanis et al.,2004].As explained in the next section,in this way we investigate the previously unaddressed possibility that there may exist many optimal solutions for TS in our domain.The results of this additional evaluation experiment are presented and emphasis is laid on their relation with the previous?ndings. Overall,this paper addresses two general issues:a)how to verify the generality of a dataset de?ned by one expert using sentence orderings provided by other experts and b)how to employ these data for the automatic evaluation of a TS ap-proach.Given that the methodology discussed in this paper does not rely on the employed metrics of coherence or the as-sumed TS approach,our work can be of interest to any NLG researcher facing these questions.

The next section discusses how the methodology imple-mented in this study complements the methods of[Karamanis et al.,2004].After brie?y introducing the employed metrics of coherence,we describe the data collected for our exper-iments.Then,we present the employed dependent variable and formulate our predictions.In the results section,we state which of these predictions were veri?ed.The paper is con-cluded with a discussion of the main?ndings.

2An additional evaluation test

As[Barzilay et al.,2002]report,different humans often order sentences in distinct ways.Thus,there might exist more than one equally good solution for TS,a view shared by almost all TS researchers,but which has not been accounted for in the evaluation methodologies of[Karamanis et al.,2004]and [Barzilay and Lee,2004].2

Collecting sentence orderings de?ned by many experts in our domain enables us to investigate the possibility that there might exist many good solutions for TS.Then,the measure of[Lapata,2003],which estimates how close two orderings stand,can be employed not only to verify the reliability of E0 but also to compare the orderings preferred by the assumed TS approach with the orderings of the experts.

However,this evaluation methodology has its limitations as well.Being engaged in other obligations,the experts nor-mally have just a limited amount of time to devote to the 2A more detailed discussion of existing corpus-based methods for evaluating TS appears in[Karamanis and Mellish,2005].NLG researcher.Similarly to standard psycholinguistic ex-periments,consulting these informants is dif?cult to extend to a larger corpus like the one used e.g.by[Karamanis et al., 2004](122sets of facts).

In this paper,we reach a reasonable compromise by show-ing how the methodology of[Lapata,2003]supplements the evaluation efforts of[Karamanis et al.,2004]using a similar (yet by necessity smaller)dataset.Clearly,a metric of coher-ence that has already done well in the previous study,gains extra bonus by passing this additional test.

3Metrics of coherence

[Karamanis,2003]discusses how a few basic notions of co-herence captured by Centering Theory(CT)can be used to de?ne a large range of metrics which might be useful for TS in our domain of interest.3The metrics employed in the ex-periments of[Karamanis et al.,2004]include:

M.NOCB which penalises NOCB s,i.e.pairs of adjacent facts without any arguments in common[Karamanis and Manurung,2002].Because of its simplicity M.NOCB serves as the?rst baseline in the experiments of[Kara-manis et al.,2004].

PF.NOCB,a second baseline,which enhances M.NOCB with a global constraint on coherence that[Karamanis, 2003]calls the PageFocus(PF).

PF.BFP which is based on PF as well as the original for-mulation of CT in[Brennan et al.,1987].

PF.KP which makes use of PF as well as the recent re-formulation of CT in[Kibble and Power,2000]. [Karamanis et al.,2004]report that PF.NOCB outper-formed M.NOCB but was overtaken by PF.BFP and PF.KP. The two metrics beating PF.NOCB were not found to differ signi?cantly from each other.

This study employs PF.BFP and PF.KP,i.e.two of the best performing metrics of the experiments in[Karamanis et al., 2004],as well as M.NOCB and PF.NOCB,the two previously used baselines.An additional random baseline is also de?ned following[Lapata,2003].

4Data collection

16sets of facts were randomly selected from the corpus of [Dimitromanolaki and Androutsopoulos,2003].4The sen-tences that each fact corresponds to and the order de?ned by E0was made available to us as well.We will subsequently refer to an unordered set of facts(or sentences that the facts correspond to)as a Testitem.

4.1Generating the BestOrders for each metric Following[Karamanis et al.,2004],we envisage a TS ap-proach in which a metric of coherence M assigns a score to 3Since discussing the metrics in detail is well beyond the scope of this paper,the reader is referred to Chapter3of[Karamanis,2003] for more information on this issue.

4These are distinct from,yet very similar to,the sets of facts used in[Karamanis et al.,2004].

each possible ordering of the input set of facts and selects the best scoring ordering as the output.When many orderings score best,M chooses randomly between them.Crucially,our hypothetical TS component only considers orderings starting with the subclass fact(e.g.subclass(ex1,amph) in Figure1)following the suggestion of[Dimitromanolaki and Androutsopoulos,2003].This gives rise to5!=120 orderings to be scored by M for each Testitem.

For the purposes of this experiment,a simple algorithm was implemented that?rst produces the120possible order-ings of facts in a Testitem and subsequently ranks them ac-cording to the scores given by M.The algorithm outputs the set of BestOrders for the Testitem,i.e.the orderings which score best according to M.This procedure was repeated for each metric and all Testitems employed in the experiment. 4.2Random baseline

Following[Lapata,2003],a random baseline(RB)was im-plemented as the lower bound of the analysis.The random baseline consists of10randomly selected orderings for each Testitem.The orderings are selected irrespective of their scores for the various metrics.

4.3Consulting domain experts

Three archaeologists(E1,E2,E3),one male and two females, between28and45years of age,all trained in cataloguing and museum labelling,were recruited from the Department of Classics at the University of Edinburgh.

Each expert was consulted by the?rst author in a separate interview.First,she was presented with a set of six sentences, each of which corresponded to a database fact and was printed on a different?lecard,as well as with written instructions de-scribing the ordering task.5The instructions mention that the sentences come from a computer program that generates de-scriptions of artefacts in a virtual museum.The?rst sentence for each set was given by the experimenter.6Then,the expert was asked to order the remaining?ve sentences in a coherent text.

When ordering the sentences,the expert was instructed to consider which ones should be together and which should come before another in the text without using hints other than the sentences themselves.She could revise her ordering at any time by moving the sentences around.When she was sat-is?ed with the ordering she produced,she was asked to write next to each sentence its position,and give them to the ex-perimenter in order to perform the same task with the next randomly selected set of sentences.The expert was encour-aged to comment on the dif?culty of the task,the strategies she followed,etc.

5Dependent variable

Given an unordered set of sentences and two possible order-ings,a number of measures can be employed to calculate the 5The instructions are given in Appendix D of[Karamanis,2003] and are adapted from the ones used in[Barzilay et al.,2002].

6This is the sentence corresponding to the subclass fact.distance between them.Based on the argumentation in[How-ell,2002],[Lapata,2003]selects Kendall’sτas the most ap-propriate measure and this was what we used for our analysis as well.Kendall’sτis based on the number of inversions between the two orderings and is calculated as follows: (1)τ=1?2I

P N

=1?2I

N(N?1)/2

P N stands for the number of pairs of sentences and N is the number of sentences to be ordered.7I stands for the number of inversions,that is,the number of adjacent transpositions necessary to bring one ordering to another.Kendall’sτranges from?1(inverse ranks)to1(identical ranks).The higher the τvalue,the smaller the distance between the two orderings. Following[Lapata,2003],the Tukey test is employed to in-vestigate signi?cant differences between averageτscores.8 First,the average distance between(the orderings of)9two experts e.g.E0and E1,denoted as T(E0E1),is calculated as the meanτvalue between the ordering of E0and the order-ing of E1taken across all16Testitems.Then,we compute T(EXP EXP)which expresses the overall average distance between all expert pairs and serves as the upper bound for the evaluation of the metrics.Since a total of E experts gives rise to P E=E(E?1)

expert pairs,T(EXP EXP),is computed by summing up the average distances between all expert pairs and dividing the sum by P E.

While[Lapata,2003]always appears to single out a unique best scoring ordering,we often have to deal with many best scoring orderings.To account for this,we?rst compute the average distance between e.g.the ordering of an expert E0 and the BestOrders of a metric M for a given Testitem.In this way,M is rewarded for a BestOrder that is close to the expert’s ordering,but penalised for every BestOrder that is not.Then,the average T(E0M)between the expert E0and the metric M is calculated as their mean distance across all 16Testitems.Finally,yet most importantly,T(EXP M)is the average distance between all experts and M.It is calculated by summing up the average distances between each expert and M and dividing the sum by the number of experts.As the next section explains in more detail,T(EXP M)is compared with the upper bound of the evaluation T(EXP EXP)to estimate the performance of M in our experiments.

RB is evaluated in a similar way as M using the10ran-domly selected orderings instead of the BestOrders for each Testitem.T(EXP RB)is the average distance between all ex-perts and RB and is used as the lower bound of the evaluation.

7In our data,N is always equal to6.

8Provided that an omnibus ANOV A is signi?cant,the Tukey test can be used to specify which of the conditions c1,...,c n measured by the dependent variable differ signi?cantly.It uses the set of means m1,...,m n(corresponding to conditions c1,...,c n)and the mean square error of the scores that contribute to these means to calculate a critical difference between any two means.An observed differ-ence between any two means is signi?cant if it exceeds the critical difference.

9Throughout the paper we often refer to e.g.“the distance be-tween the orderings of the experts”with the phrase“the distance between the experts”for the sake of brevity.

E0E1:******

0.692E0E2:******

0.717E1E2:******

0.758E0E3:

CD at0.01:0.3380.258E1E3:

CD at0.05:0.2820.300E2E3:

F(5,75)=14.931,p<0.0000.192 Table1:Comparison of distances between the expert pairs

6Predictions

Despite any potential differences between the experts,one ex-pects them to share some common ground in the way they or-der sentences.In this sense,a particularly welcome result for our purposes is to show that the average distances between E0and most of her colleagues are short and not signi?cantly different from the distances between the other expert pairs, which in turn indicates that she is not a“stand-alone”expert. Moreover,we expect the average distance between the ex-pert pairs to be signi?cantly smaller than the average distance between the experts and RB.This is again based on the as-sumption that even though the experts might not follow com-pletely identical strategies,they do not operate with absolute diversity either.Hence,we predict that T(EXP EXP)will be signi?cantly greater than T(EXP RB).

Due to the small number of Testitems employed in this study,it is likely that the metrics do not differ signi?cantly from each other with respect to their average distance from the experts.Rather than comparing the metrics directly with each other(as[Karamanis et al.,2004]do),this study com-pares them indirectly by examining their behaviour with re-spect to the upper and the lower bound.For instance,al-though T(EXP P F.KP)and T(EXP P F.BF P)might not be signi?cantly different from each other,one score could be sig-ni?cantly different from T(EXP EXP)(upper bound)and/or T(EXP RB)(lower bound)while the other is not.

We identify the best metrics in this study as the ones whose average distance from the experts(i)is signi?cantly greater from the lower bound and(ii)does not differ signi?cantly from the upper bound.10

7Results

7.1Distances between the expert pairs

On the?rst step in our analysis,we computed the T score for each expert pair,namely T(E0E1),T(E0E2),T(E0E3), T(E1E2),T(E1E3)and T(E2E3).Then we performed all 15pairwise comparisons between them using the Tukey test, the results of which are summarised in Table1.11

The cells in the Table report the level of signi?cance re-turned by the Tukey test when the difference between two 10Criterion(ii)can only be applied provided that the average dis-tance between the experts and at least one metric M x is found to be signi?cantly lower than T(EXP EXP).Then,if the average dis-tance between the experts and another metric M y does not differ signi?cantly from T(EXP EXP),M y performs better than M x.

11The Table also reports the result of the omnibus ANOV A,which is signi?cant:F(5,75)=14.931,p<0.000.

E0E1:******

0.692E0E2:******

0.717E1E2:******

0.758E0RB:

CD at0.01:0.2420.323E1RB:

CD at0.05:0.2020.347E2RB:

F(5,75)=18.762,p<0.0000.352

E0E3:

0.258E1E3:

0.300E2E3:

CD at0.01:0.2190.192E3RB:

CD at0.05:0.1770.302

F(3,45)=1.223,p=0.312

Table2:Comparison of distances between the experts(E0, E1,E2,E3)and the random baseline(RB)

distances exceeds the critical difference(CD).Signi?cance beyond the0.05threshold is reported with one asterisk(*), while signi?cance beyond the0.01threshold is reported with two asterisks(**).A cell remains empty when the difference between two distances does not exceed the critical difference. For example,the value of T(E0E1)is0.692and the value of T(E0E3)is0.258.Since their difference exceeds the CD at the0.01threshold,it is reported to be signi?cant beyond that level by the Tukey test,as shown in the top cell of the third column in Table1.

As the Table shows,the T scores for the distance between E0and E1or E2,i.e.T(E0E1)and T(E0E2),as well as the T for the distance between E1and E2,i.e.T(E1E2),are quite high which indicates that on average the orderings of the three experts are quite close to each other.Moreover,these T scores are not signi?cantly different from each other which suggests that E0,E1and E2share quite a lot of common ground in the ordering task.Hence,E0is found to give rise to similar orderings to the ones of E1and E2.

However,when any of the previous distances is compared with a distance that involves the orderings of E3the differ-ence is signi?cant,as shown by the cells containing two as-terisks in Table1.In other words,although the orderings of E1and E2seem to deviate from each other and the orderings of E0to more or less the same extent,the orderings of E3 stand much further away from all of them.Hence,there ex-ists a“stand-alone”expert among the ones consulted in our studies,yet this is not E0but E3.

This?nding can be easily explained by the fact that by con-trast to the other three experts,E3followed a very schematic way for ordering sentences.Because the orderings of E3 manifest rather peculiar strategies,at least compared to the or-derings of E0,E1and E2,the upper bound of the analysis,i.e. the average distance between the expert pairs T(EXP EXP), is computed without taking into account these orderings: (2)T(EXP EXP)=0.722=T(E0E1)+T(E0E2)+T(E1E2)

7.2Distances between the experts and RB

As the upper part of Table2shows,the T score between any two experts other than E3is signi?cantly greater than their distance from RB beyond the0.01threshold.Only the dis-

tances between E3and another expert,shown in the lower section of Table2,are not signi?cantly different from the dis-tance between E3and RB.

Although this result does not mean that the orders of E3 are similar to the orders of RB,12it shows that E3is roughly as far away from e.g.E0as she is from RB.By contrast, E0stands signi?cantly closer to E1than to RB,and the same holds for the other distances in the upper part of the Table. In accordance with the discussion in the previous section,the lower bound,i.e.the overall average distance between the experts(excluding E3)and RB T(EXP RB),is computed as shown in(3):

(3)T(EXP RB)=0.341=T(E0RB)+T(E1RB)+T(E2RB)

7.3Distances between the experts and each metric So far,E3was identi?ed as an“stand-alone”expert standing further away from the other three experts than they stand from each other.We also identi?ed the distance between E3and each expert as similar to her distance from RB.

Similarly,E3was found to stand further away from the metrics compared to their distance from the other three ex-perts.13This result,gives rise to the set of formulas in(4)for calculating the overall average distance between the experts (excluding E3)and each metric.

(4)(4.1):T(EXP P F.BF P)=0.629=

T(E0P F.BF P)+T(E1P F.BF P)+T(E2P F.BF P)

(4.2):T(EXP P F.KP)=0.571=

T(E0P F.KP)+T(E1P F.KP)+T(E2P F.KP)

(4.3):T(EXP P F.NOCB)=0.606=

T(E0P F.NOCB)+T(E1P F.NOCB)+T(E2P F.NOCB)

(4.4):T(EXP M.NOCB)=0.487=

T(E0M.NOCB)+T(E1M.NOCB)+T(E2M.NOCB)

In the next section,we present the concluding analysis for this study which compares the overall distances in formu-las(2),(3)and(4)with each other.As we have already mentioned,T(EXP EXP)serves as the upper bound of the analysis whereas T(EXP RB)is the lower bound.The aim is to specify which scores in(4)are signi?cantly greater than T(EXP RB),but not signi?cantly lower than T(EXP EXP).

7.4Concluding analysis

The results of the comparisons of the scores in(2),(3)and(4) are shown in Table3.As the top cell in the last column of the Table shows,the T score between the experts and RB, T(EXP RB),is signi?cantly lower than the average distance between the expert pairs,T(EXP EXP)at the0.01level.

12This could have been argued,if the value of T(E3

)had been much closer to1.

13Due to space restrictions,we cannot report the scores for these comparisons here.The reader is referred to Table9.4on page175 of Chapter9in[Karamanis,2003].This result veri?es one of our main predictions showing that the orderings of the experts(modulo E3)stand much closer to each other compared to their distance from randomly as-sembled orderings.

As expected,most of the scores that involve the met-rics are not signi?cantly different from each other,ex-cept for T(EXP P F.BF P)which is signi?cantly greater than T(EXP M.NOCB)at the0.05level.Yet,what we are mainly interested in is how the distance between the experts and each metric compares with T(EXP EXP)and T(EXP RB).This is shown in the?rst row and the last column of Table3. Crucially,T(EXP RB)is signi?cantly lower than T(EXP P F.BF P)as well as T(EXP P F.NOCB)and T(EXP P F.KP)at the0.01level.Notably,even the dis-tance of the experts from M.NOCB,T(EXP M.NOCB),is signi?cantly greater than T(EXP RB),albeit at the0.05 level.These results show that the distance from the experts is signi?cantly reduced when using the best scoring orderings of any metric,even M.NOCB,instead of the orderings of RB.Hence,all metrics score signi?cantly better than RB in this experiment.

However,simply using M.NOCB to output the best scoring orders is not enough to yield a distance from the experts which is comparable to T(EXP EXP).Al-though the PF constraint appears to help towards this di-rection,T(EXP P F.KP)remains signi?cantly lower than T(EXP EXP),whereas T(EXP P F.NOCB)falls only0.009 points short of CD at the0.05threshold.Hence,PF.BFP is the most robust metric,as the difference between T(EXP P F.BF P)and T(EXP EXP)is clearly not signi?-cant.

Finally,the difference between T(EXP P F.NOCB)and T(EXP M.NOCB)is only0.006points away from the CD. This result shows that the distance from the experts is reduced to a great extent when the best scoring orderings are com-puted according to PF.NOCB instead of simply M.NOCB. Hence,this experiment provides additional evidence in favour of enhancing M.NOCB with the PF constraint of coherence, as suggested in[Karamanis,2003].

8Discussion

A question not addressed by previous studies making use of a certain collection of orderings of facts is whether the strate-gies re?ected there are speci?c to E0,the expert who created the dataset.In this paper,we address this question by enhanc-ing E0’s dataset with orderings provided by three additional experts.Then,the distance between E0and her colleagues is computed and compared to the distance between the other expert pairs.The results indicate that E0shares a lot of com-mon ground with two of her colleagues in the ordering task deviating from them as much as they deviate from each other, while the orderings of a fourth“stand-alone”expert are found to manifest rather individualistic strategies.

The same variable used to investigate the distance between the experts is employed to automatically evaluate the best scoring orderings of some of the best performing metrics in [Karamanis et al.,2004].Despite its limitations due to the

necessarily restricted size of the employed dataset,this eval-

EXP EXP:******

0.722EXP P F.BF P:***

0.629EXP P F.NOCB:**

0.606EXP P F.KP:**

CD at0.01:0.1500.571EXP M.NOCB:*

CD at0.05:0.1250.487EXP RB:

F(5,75)=19.111,p<0.0000.341

Table3:Results of the concluding analysis comparing the distance between the expert pairs(EXP EXP)with the distance between the experts and each metric(PF.BFP,PF.NOCB,PF.KP,M.NOCB)and the random baseline(RB)

uation task allows us to explore the previously unaddressed possibility that there exist many good solutions for TS in the employed domain.

Out of a much larger set of possibilities,10metrics were evaluated in[Karamanis et al.,2004],only a handful of which were found to overtake two simple baselines.The additional test in this study carries on the elimination process by point-ing out PF.BFP as the single most promising metric to be used for TS in the explored domain,since this is the metric that manages to clearly survive both tests.

Equally crucially,our analysis shows that all employed metrics are superior to a random baseline.Additional evi-dence in favour of the PF constraint on coherence introduced in[Karamanis,2003]is provided as well.The general evalu-ation methodology as well as the speci?c results of this study will be useful for any subsequent attempt to automatically evaluate a TS approach using a corpus of sentence orderings de?ned by many experts.

As[Reiter and Sripada,2002]suggest,the best way to treat the results of a corpus-based study is as hypotheses which eventually need to be integrated with other types of evalua-tion.Although we followed the ongoing argumentation that using perceptual experiments to choose between many possi-ble metrics is unfeasible,our efforts have resulted into a sin-gle preferred candidate which is much easier to evaluate with the help of psycholinguistic techniques(instead of having to deal with a large number of metrics from very early on).This is indeed our main direction for future work in this domain. Acknowledgments

We are grateful to Aggeliki Dimitromanolaki for entrusting us with her data and for helpful clari?cations on their use;to Mirella Lapata for providing us with the scripts for the com-putation ofτtogether with her extensive and prompt advice; to Katerina Kolotourou for her invaluable assistance in re-cruiting the experts;and to the experts for their participation. This work took place while the?rst author was studying at the University of Edinburgh,supported by the Greek State Scholarship Foundation(IKY).

References

[Barzilay and Lee,2004]Regina Barzilay and Lillian Lee.Catch-ing the drift:Probabilistic content models with applications to generation and summarization.In Proceedings of HLT-NAACL 2004,pages113–120,2004.

[Barzilay et al.,2002]Regina Barzilay,Noemie Elhadad,and Kathleen McKeown.Inferring strategies for sentence ordering

in multidocument news summarization.Journal of Arti?cial In-telligence Research,17:35–55,2002.

[Brennan et al.,1987]Susan E.Brennan,Marilyn A.Fried-man[Walker],and Carl J.Pollard.A centering approach to pro-nouns.In Proceedings of ACL1987,pages155–162,Stanford, California,1987.

[Dimitromanolaki and Androutsopoulos,2003]Aggeliki Dimitro-manolaki and Ion Androutsopoulos.Learning to order facts for discourse planning in natural language generation.In Proceed-ings of the9th European Workshop on Natural Language Gener-ation,Budapest,Hungary,2003.

[Howell,2002]David C.Howell.Statistical Methods for Psychol-ogy.Duxbury,Paci?c Grove,CA,5th edition,2002.

[Isard et al.,2003]Amy Isard,Jon Oberlander,Ion Androutsopou-los,and Colin Matheson.Speaking the users’languages.IEEE Intelligent Systems Magazine,18(1):40–45,2003. [Karamanis and Manurung,2002]Nikiforos Karamanis and Hisar Maruli Manurung.Stochastic text structuring using the principle of continuity.In Proceedings of INLG2002,pages 81–88,Harriman,NY,USA,July2002.

[Karamanis and Mellish,2005]Nikiforos Karamanis and Chris Mellish.A review of recent corpus-based methods for evaluat-ing text structuring in NLG.2005.Submitted to Using Corpora for NLG workshop.

[Karamanis et al.,2004]Nikiforos Karamanis,Chris Mellish,Jon Oberlander,and Massimo Poesio.A corpus-based methodology for evaluating metrics of coherence for text structuring.In Pro-ceedings of INLG04,pages90–99,Brockenhurst,UK,2004. [Karamanis,2003]Nikiforos Karamanis.Entity Coherence for De-scriptive Text Structuring.PhD thesis,Division of Informatics, University of Edinburgh,2003.

[Kibble and Power,2000]Rodger Kibble and Richard Power.An integrated framework for text planning and pronominalisation.In Proceedings of INLG2000,pages77–84,Israel,2000. [Lapata,2003]Mirella Lapata.Probabilistic text structuring:Ex-periments with sentence ordering.In Proceedings of ACL2003, pages545–552,Saporo,Japan,July2003.

[McKeown,1985]Kathleen McKeown.Text Generation:Using Discourse Strategies and Focus Constraints to Generate Natural Language Text.Studies in Natural Language Processing.Cam-bridge University Press,1985.

[Reiter and Sripada,2002]Ehud Reiter and Somayajulu Sripada.

Should corpora texts be gold standards for NLG?In Proceedings of INLG2002,pages97–104,Harriman,NY,USA,July2002.

英文写作指导——练写段落主题句topic sentences practice

Topic S entences: P ractice Read t he p aragraphs b elow. T hey a re m issing a t opic s entence. W rite a t opic s entence t hat introduces t he m ain i dea o f e ach p aragraph. *Note: Y ou d o n ot h ave t o u se a s imile o r m etaphor, b ut s ometimes t hese a re g ood w ays t o write i nteresting t opic s entences. ____________________________________________________________________. Who takes care of you? Who supports you? Who sees you grow up? Family is very important. My family has six people: my grandma, my parents, myself, and my two brothers. My grandma loves me very much. When the weather is cold, she always tells me to wear more clothes. Although I often argue with my brothers, they will give me support when I need it. My parents have taken care of me since I was born. My definition of family is an organization which is full of love. ____________________________________________________________________. When you travel to Europe, you can visit many different countries, such as England, Spain, Germany, and Greece. Many different languages are spoken in Europe, and the cultures of the countries are all unique. Also, the weather in Europe varies a lot. Countries in the north are very cold, and you can go skiing. In the south, there are beautiful beaches, and these are popular places for vacations. As you can see, Europe is a very interesting place with different kinds of people and many possibilities. ____________________________________________________________________. Her name is Mrs. Graham, and she not only teaches music in my school, but she is also a friend to all of her students. In class, she teaches us to love music, and she introduces us to different songs and styles of music. She taught me to play the piano and violin, and I am sure that I will enjoy playing these instruments for the rest of my life. Mrs. Graham often tells interesting stories in class, and she always helps us or gives us advice when we have problems. Mrs. Graham is more than just a music teacher, she is like a star in the sky.

topic sentence and controlling idea

Topic sentences and controlling idea 1.The Colorado mountains are the most beautiful in America." topic sentence.：Colorado mountains controlling idea：in America 2.The life cycle of a frog has two stages. topic sentence：The life cycle of the frog controlling idea.：has two stages topic sentences(in red) and controlling ideas (in green). 1. People can avoid burglaries by taking certain precautions. (The precautions for? 2. There are several advantages to growing up in a small town. (The advantages of? 3. Most US universities require a 550 point TOEFL score for a number of reasons. (The reasons for? 4. Air pollution in Mexico City is the worst in the world for a number of reasons. (The causes of? or (The effects of? 5. Fixing a flat tire on a bicycle is easy if you follow these steps. (The steps for? 6. There are several enjoyable ways to travel between the US and Queretaro. (The ways to? or (The methods of? 7. Animals in danger of becoming extinct come from a wide range of countries. (The different countries?[parts, kinds, types]) 8. Effective leadership requires specific qualities that anyone can develop. (The qualities (or characteristics or traits) of? 9. Industrial waste poured into Lake Michigan has led to dramatic changes in its ability to support marine life. (The effects of?

topic sentence

Topic Sentence Ⅰ. Choose the best topic sentence from the group below. 1. A. Picasso was thought to be dead at birth in Malaga on. Oct. 25, 1881. B. By the age of 25, Picasso was an able and gifted artist. C. Picasso’s father was a painter named Jose Ruiz Blasco. D. The full sweep of Picasso’s effect on modern art is difficult to document. Answer: ___________ 2. A. In later adulthood，we begin to come to terms with our own mortality. B. There are various stages of human development. C. Adolescence is typically a time of identity crisis. D. Psychologists report that we pass through various stages of development throughout our lives. Answer： Ⅱ. Read the following paragraph carefully and select the best topic sentence from the four possible answers that follow the paragraph. 1.Topic Sentence： “Music，”the teacher would tell his pupils，“is a state of being. It is not so much knowledge and know-how .If you want to be good at playing an instrument，let music get hold of you first and this will in turn get hold your muscles and make them produce the music that is now inside you. How can music come out of an instrument if it is not first put into it？And who is to put it there？The composer，the maker of the instrument，the printed score or the player？” A. If you want to be good at playing an instrument，let music get hold of you first. B. It is about stages of learning to play an instrument. C. Only when the player puts the music into the instrument，can music come out of it. D. Only when the maker of the instrument puts the music into the instrument，can music come out of it. 2. Topic sentence ___________ At one time, transistor radios were not practical, because they were too expensive, Now all of that has changed. With the reduced price of transistors and the cheaper costs of mass production, the transistor radio is cheaper than the old-style tube model. In addition, transistor radios do not heat up like the old tube radios, so they will not wear out as quickly. Also, transistor radios can be made much smaller because transistors are smaller than tubes. Furthermore, transistor radios are more reliable. They have fewer parts, so less can go wrong. A. Transistor radios are practical and inexpensive. B. Transistor radios have undergone much improvement. C. Transistor radios are cheaper than tube radios because of mass production. D. Transistor radios are better than the old-style tube radios. Ⅲ. Read the following passages and identify the topic sentence in each by underlining it. 1. The biggest problem in ancient DNA research is getting the DNA in the first place. The favorite material to work with is bone, and a small chunk of it is best. Cells can lie inside the hard

topicsentence英语主题句

讲解主题句是一个完整的句子，用以概括、叙述或说明该段落的主旨大意。每一个段落只能有一个主题，全段的其他文字都应围绕它展开。主题句通常由两部分组成，即主题(topic) +中心思想（controlling idea）。中心思想的作用是导向(control) 和制约(limit) 主题。所谓导向就是规定段落的发展脉络，所谓制约就是限制主题的覆盖范围，两者不可分割。没有导向，内容就会离题或偏题；没有制约，内容就可能超出一个段落所能容纳的范围。下面的8 个句子都是很漂亮的主题句，其中红色加粗的文字为主题，绿色斜体的文字为中心思想： 1. People can avoid burglaries（入室行窃）by taking certain precautions. ( 接下来作者将围绕着“防范措施”展开段落) 2. There are several advantages to growing up in a small town. ( 接下来作者将围绕着“优点”展开段落) 3. Most US universities require a 550 point TOEFL score for a number of reasons. ( 接下来作者将围绕着“原因”展开段落) 4. Fixing a flat tire on a bicycle is easy if you follow these steps. ( 接下来作者将围绕着“步骤”展开段落) 5. There are several enjoyable ways to travel between the US and Queretaro. ( 接下来作者将围绕着“方法”展开段落) 6. Effective leadership requires specific qualities that anyone can develop. ( 接下来作者将围绕着“品质”展开段落) 7. Industrial waste poured into Pearl River has led to dramatic changes in its ability to support marine life. ( 接下来作者将围绕着“变化”展开段落) 8. In order to fully explore the wreck（残骸）of the Titanic, scientists must address several problems. ( 接下来作者将围绕着“问题”展开段落) 通过对上面8个主题句的分析可见：段落的主题句对主题的限定主要是通过句中的关键词来实现的。也就是说，中心思想中应包含着一个相对具体的关键词

topic sentence练习

Exercises 1. Revise topic sentences. Topic sentences as below are not effective one. Try to find out what is wrong and revise them. Model: Original: Columbus was an explorer in the 1400s. Revision: Travel has changed since the days of Columbus. (1) Original: Internet is the topic of the paragraph. Revision: (2) Original: I had a very bright student long ago. Revision: (3) Original: People waste time Revision: (4)Original:Every extra hour of watching television meant an 8% increase in the chances of developing signs of depression. Revision: (5)Original:Failure is the mother of success and fearing mistakes prevents us from making change and slow down personal progress. Revision: 2. Identify the topic sentence in each of the following paragraphs. Supply a topic sentence if there is no explicit one in it. (1) Opera in the 19th century developed in two main ways. Fist, there was the type that preferred to give the greatest importance to the voice, keeping the orchestra more as an accompaniment. And second, there was the type that reduced the importance of the singers and allowed the orchestra to dominate the proceedings. The first type could be described as OPEAR OF VOCAL MELODY. You can fid examples in the work of Rossini, Belline, Donizetti, and of course, Verdi, The second type is really SYMPHONICOPERA. Wagner is its most important exponent. (2) Increasingly over the past ten years, people have become aware of the need to change their eating habits, because what we choose to eat can influence our health—for better or worse Consequently, there has been a growing interest in natural foods. (3) Of all human emotions, none is more natural than the love for the town, the valley or the neighborhood where we grew up. Our homeland speaks to our most intimate memories, moves our deepest emotions. Everything that is a part of it belongs to us in some measure. In a way we belong to it, too, as a leaf belongs to a tree. (4) Standing in the Sicilian sun with your feet washed by the waves of the Tyrrhenian Sea, it’s easy to forget that there’s more to the tri-corner island than beauty. The beaches are magnificent—turquoise waters stretching out to the horizon, soft sand, gentle waves and some of the clearest water in the world. In the distance, on the rockier portions of the coast, volcanic rock juts out of the sea, its dark color and jagged edge adding drama to the vista.

topic sentence 主题句

Looking for the Topic Sentence More often than not, one sentence in a paragraph tells the reader exactly what the subject of the paragraph is and thus gives ?the ?main idea. This main idea sentence is called a topic sentence or topic statement. The topic sentence states briefly an idea whose full meaning and significance are developed by the supporting ?details. It may appear at the beginning, or in the middle, or at the end of a paragraph. Sample 1: At the beginning London's weather is very strange . It can rain several times? a? day; each time the rain may come suddenly after the sun is shining brightly. The air is damp and chill right through July. On ?one March afternoon on Hampton Heath last year it rained three times, there was one hail(冰雹) storm, and the sun shone brilliantly -?-?all ?this within two hours' time. It is not unusual to see men and women rushing down the street on a sunny morning with umbrellas on their arms. No one knows what the next few moments will bring. (The main idea in this paragraph is London's weather is very strange . All the other sentences illustrate the ?idea ?with supporting details.) Sample 2: In the middle Just as I settle down to read or watch television, he demands that I play with him. If I ?get ?a ?telephone call,? he ?screams in ?the background or knocks something over. I always have to hang up to ?find out what's wrong with him. Baby- sitting with my little brother is no fun . He refuses to let me eat a snack in peace. Usually he ?wants half of whatsoever I have to eat. Then, when he finally grows tired, it takes about an hour for him to fall asleep. (All the details are cited in this paragraph ?to support the main idea: Baby--sitting with my little brother is no fun. ) Sample 3: At the end Doctors are of the opinion that most people cannot live beyond 100 years, but a growing number ?of ?scientists believe ?that ?the aging process can be controlled. There are more than 12,000 Americans ?over 100 years old, and their numbers are increasing each year. Dr.? James Langley of Chicago claims that, theoretically and under ?ideal conditions, animals, including man, can live ?six ?times longer ?than their normal period of growth. A person's period of growth lasts about 25 years. ?If ?Dr. ?Langley's theory ?is ?accurate?, ?future generations can expect a life span of 150 years. Sometimes a writer wants to give strong emphasis to a topic sentence. He may place a topic sentence at both the beginning ?and end ?of ?a paragraph. This can tell a reader that the idea in this paragraph ?is more important than other ideas found in other paragraphs.

英语作文之topic sentence

20100512062王珍Topic 1: Factors Influencing Young Adults 中文： Topic sentences: 1、There is no generation gap between Friends and us. 2、Friends are playing essential parts in our social activities . 3、It is a proper way to share our personal emotions with friends. Topic 2: The Benefits for Working as Village Officials 中文：较稳定的收入、竞争较小、生活环境较好、深入基层、体察情民情，体会农村的疾苦、为政治前途积累经验、锻炼吃苦耐劳的精神、 Topic sentences: 1、W orking as village officials is accessible to obtaining stable salary

and lead a healthy life-style 2、W orking in village enables officials to consider the people living in poverty. 3、W orking in the basement layer is an essential primary stage for officials during the way of deepening their political career. Topic 3: Should Smoking Be Banned in Public Places? 中文:危害他人健康、缺乏社会公德心、二手烟民、同时祸及多人、危害程度大、范围广、侵害他人健康的权益、害人害己 Topic sentences: 1、Smoking in public places is lacking the consciousness of social obligation and conscience. 2、Smoking in public places deprives the surrounding people’s right of healthy by unconscious. 3、Smoking in public places has the potential possibility of damaging the health of our whole social members in a widely and speedily way. Topic 4: How to Get Along with Your Roommates 中文：互相理解、支持、包容、平等、真诚待人、互相关心、团队精神、 Topic sentences: 1、Getting along with roommates requires adequate

主题句简介topic sentence

主题句简介topic sentence 二、主题句简介 2．1 主题句作者的首要任务是让读者知道所写段落要谈的是什么，这就是每段的主题句的作用。因此主题句应该阐明段落的主要思想，所有支持主题句的细节和描述都与这一主要思想有关。2．2 主题句的形式主题句通常有以下三种形式：1）肯定句（Affirmative Sentence）Example：The need for wildlif e protection is greater now than ever before．2）反诘句（Rhetorical Sentence）Example：How do you think people will solve the problem o f wildlife protection？3）不完整句（Fragments）Example：And the workingman？初学者最好使用肯定句作为主题句。2．3 主题句的位置主题句出现的位置有以下四种情况：1）段首（At the beginning）主题句经常居于段首，以便读者浏览主题句就可掌握文章的概要。这个位置适用于写提供信息或解释观点的段落。2）段末（At the end）用推理方法展开段落时，主题句往往位于句末。3）段中（In the middle）有时为了使段落多样化，主题句也可以居于段中。 4）隐含（Implied）有时候，尤其在写叙述性或描写性段落时，当所有的细节都围绕着一个显而易见的主题时可以不用主题句。2Exercise 2-1 Directions：Read the following paragraphs and identify the topic sentence．If it is

topic sentence英语主题句教学文案

t o p i c s e n t e n c e英语主题句

精品文档讲解主题句是一个完整的句子，用以概括、叙述或说明该段落的主旨大意。每一个段落只能有一个主题，全段的其他文字都应围绕它展开。主题句通常由两部分组成，即主题 (topic) +中心思想（controlling idea）。中心思想的作用是导向 (control) 和制约 (limit) 主题。所谓导向就是规定段落的发展脉络，所谓制约就是限制主题的覆盖范围，两者不可分割。没有导向，内容就会离题或偏题；没有制约，内容就可能超出一个段落所能容纳的范围。下面的8个句子都是很漂亮的主题句，其中红色加粗的文字为主题，绿色斜体的文字为中心思想： 1. People can avoid burglaries（入室行窃）by taking certain precautions. ( 接下来作者将围绕着“防范措施”展开段落) 2. There are several advantages to growing up in a small town. ( 接下来作者将围绕着“优点”展开段落) 3. Most US universities require a 550 point TOEFL score for a number of reasons. ( 接下来作者将围绕着“原因”展开段落) 4. Fixing a flat tire on a bicycle is easy if you follow these steps. ( 接下来作者将围绕着“步骤”展开段落) 5. There are several enjoyable ways to travel between the US and Queretaro. ( 接下来作者将围绕着“方法”展开段落) 6. Effective leadership requires specific qualities that anyone can develop. ( 接下来作者将围绕着“品质”展开段落) 7. Industrial waste poured into Pearl River has led to dramatic changes in its ability to support marine life. ( 接下来作者将围绕着“变化”展开段落) 8. In order to fully explore the wreck（残骸） of the Titanic, scientists must address several problems. ( 接下来作者将围绕着“问题”展开段落) 收集于网络，如有侵权请联系管理员删除

写作知识topic sentence

MORE EXPLANATION ABOUT TOPIC SENTENCE What is a topic sentence? A topic sentence states the main point of a paragraph: it serves as a mini-thesis for the paragraph. Y ou might think of it as a signpost for your readers—or a headline—something that alerts them to the most important, interpretive points in your essay. When read in sequence, your essay's topic sentences will provide a sketch of the essay's argument. Thus topics sentences help protect your readers from confusion by guiding them through the argument. But topic sentences can also help you to improve your essay by making it easier for you to recognize gaps or weaknesses in your argument. Where do topic sentences go? T opic sentences usually appear at the very beginning of paragraphs. In the following example from Anatomy of Criticism, Northrop Frye establishes the figure of the tragic hero as someone more than human, but less than divine. He backs up his claim with examples of characters from literature, religion and mythology whose tragic stature is a function of their ability to mediate between their fellow human beings and a power that transcends the merely human: The tragic hero is typically on top of the wheel of fortune, half-way between human society on the ground and the something greater in the sky.Prometheus, Adam, and Christ hang between heaven and earth, between a world of paradisal freedom and a world of bondage. Tragic heroes are so much the highest points in their human landscape that they seem the inevitable conductors of the power about them, great trees more likely to be struck by lightning than a clump of grass. Conductors may of course be instruments as well as victims of the divine lightning: Milton's Samson destroys the Philistine temple with himself, and Hamlet nearly exterminates the Danish court in his own fall. The structure of Frye's paragraph is simple yet powerful: the topic sentence makes an abstract point, and the rest of the paragraph elaborates on that point using concrete examples as evidence. Does a topic sentence have to be at the beginning of a paragraph? No, though this is usually the most logical place for it. Sometimes a transitional sentence or two will come before a topic sentence: We found in comedy that the term bomolochos or buffoon buffoon ( [b?'fu:n] 愚蠢的人,傻瓜,逗乐小丑，滑稽的人)need not be restricted to farce闹剧, but could be extended to cover comic characters who are primarily entertainers表演者, with the function of increasing or focusing the comic mood. The corresponding contrasting type is the suppliant, the character, often female, who presents a picture of unmitigated十足的helplessness and destitution贫穷. Such a figure is pathetic, and pathos, though it seems a gentler and more relaxed mood than tragedy, is even more terrifying. Its basis is the exclusion of an individual from the group; hence it attacks the deepest fear in ourselves that we possess--a fear much deeper than the relatively cosy and sociable bogey妖怪of