How does lucene scoring work




















Find centralized, trusted content and collaborate around the technologies you use most. Connect and share knowledge within a single location that is structured and easy to search.

I'm working with Hibernate Search for months now, but still I'm not able to digest the relevance it brings.

I'm overall satisfied with the results it returns, but even simplest test does not satisfy my expectation. I'm really confused with this scoring effect. My Query is quite complex, but as this test did not have any other field involved, it can be simplified as below: booleanjunction.

Scoring calculation is something really complex. Here, you have to begin with the primal equation :. As you said, you have tf which means term frequency and its value is the squareroot of the frequency of the term. But here, as you can see in your explanation , you also have norm aka fieldNorm which is used in fieldWeight calculation. Let's take your example:. Here, eklavya has a better score than the other because fieldWeight is the product of tf , idf and fieldNorm. This last one is higher for eklavya document because he only contains one term.

One disadvantage of this approach is that the multiplier function is linear and will not work very well when contentAge is 0. Another possible problem might be the maximum of the multiplier function becoming too huge, thus making the default score irrelevant. An alternative approach is adding a constant to the formula, where the constant can be any number, depending how much we want to boost the new results.

For example, 2 does the job relatively well. Adding a constant makes the boosting function still linear, but it has an improved effect on boosting recent items more aggressively than older results.

To address the potential of the boosting function to behave too linear, you can use more than one constant to introduce variables such as boostFactor , maxRampFactor and curveAdjustmentFactor. For example, a function that is getting the job well done could be:.

To understand better how to fine-tune these constants fit your preference, preferences, refer to the following diagram visualizing the boosting formula:. To create a custom score query, you must start by adding a new class which inherits from the Lucene CustomScoreProvider. This provider is responsible for the search score logic.

Inside the new class you must override the CustomScore method. This method gives you access to the Lucene document and the default score, which you can obtain by making a call to the base class method. From the document object you can extract the LastModified field value and use it to determine the document age in days.

Now that you have access to the content age and default score , you can implement your desired custom scoring logic. For example, to implement an exponential boosting function with several constants, as described earlier in this article, you can add a method in your custom provider called CalculateBoost.

You can call this method from the CustomScore method and pass the calculated content age as a parameter. Inside CalculateBoost you can calculate a boost value based on the additional constants you define and the content age input. Once you have completed implementing the custom score provider, you must add a new class and inherit from the Lucene CustomScoreQuery class.

Inside this class you must override the GetCustomScoreProvider method, which instructs Lucene which provider to use when determining the search score. Following are the fields vailable in algorithm. Factors Description tf term frequency measure of how regularly a term shows up in the report. Key Points. Scoring is very much dependent on the way documents are indexed. The customScoreQuery is a much less demanding thing to do than actualizing an entire query.

The recipe utilized for scoring is known as the practical scoring function. Hide Index Show Index. Chapter 9. Previous Next.

It is also assumed that readers know how to use the Searcher. In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored i. It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field will return different scores for the same query due to length normalization assumming the DefaultSimilarity on the Fields.

Lucene allows influencing search results by "boosting" in more than one level: Document level boosting - while indexing - by calling document. Document's Field level boosting - while indexing - by calling field.

Query level boosting - during search, by setting a boost on a query clause, calling Query. Indexing time boosts are preprocessed for storage efficiency and written to the directory when writing the document in a single byte! The result is multiplied by the boost of the document, and also multiplied by a "field length norm" value that represents the length of that field in that doc so shorter fields are automatically boosted up.

The result is decoded as a single byte with some precision loss of course and stored in the directory. The similarity object in effect at indexing computes the length-norm of the field. Encoding and decoding of the resulted float norm in a single byte are done by the static methods of the class Similarity: encodeNorm and decodeNorm. At scoring search time, this norm is brought into the score of document as norm t, d , as shown by the formula in Similarity.

This scoring formula is described in the Similarity class. Please take the time to study this formula, as it contains much of the information about how the basics of Lucene scoring work, especially the TermQuery.

OK, so the tf-idf formula and the Similarity is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are the use and interactions between the Query classes, as created by each application in response to a user's information need. In this regard, Lucene offers a wide variety of Query implementations, most of which are in the org.



0コメント

  • 1000 / 1000