java.lang.Objectorg.apache.lucene.search.Similarity
All Implemented Interfaces:
Serializable
Direct Known Subclasses:
SimilarityDelegator, DefaultSimilarity, SimpleSimilarity, SweetSpotSimilarity
Subclasses implement search scoring.
The score of query q for document d correlates to the
cosine-distance or dot-product between document and query vectors in a
Vector Space Model (VSM) of Information Retrieval.
A document whose vector is closer to the query vector in that model is scored higher.
The score is computed as follows:
|
where
| tf(t in d) = | frequency½ |
| idf(t) = | 1 + log ( |
|
) |
| queryNorm(q) = queryNorm(sumOfSquaredWeights) = |
|
| sumOfSquaredWeights = q.getBoost() 2 · | ∑ | ( idf(t) · t.getBoost() ) 2 |
| t in q |
When a document is added to the index, all the above factors are multiplied.
If the document has multiple fields with the same name, all their boosts are multiplied together:
| norm(t,d) = doc.getBoost() · lengthNorm(field) · | ∏ | f.getBoost () |
| field f in d named as t |
| Method from org.apache.lucene.search.Similarity Summary: |
|---|
| coord, decodeNorm, encodeNorm, getDefault, getNormDecoder, idf, idf, idf, lengthNorm, queryNorm, scorePayload, setDefault, sloppyFreq, tf, tf |
| Methods from java.lang.Object: |
|---|
| equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Method from org.apache.lucene.search.Similarity Detail: |
|---|
The presence of a large portion of the query terms indicates a better match with the query, so implementations of this method usually return larger values when the ratio between these parameters is large and smaller values when the ratio between them is small. |
|
The encoding uses a three-bit mantissa, a five-bit exponent, and the zero-exponent point at 15, thus representing values from around 7x10^9 to 2x10^-9 with about one significant decimal digit of accuracy. Zero is also represented. Negative numbers are rounded up to zero. Values too large to represent are rounded down to the largest representable value. Positive values too small to represent are rounded up to the smallest positive representable value. |
This is initially an instance of DefaultSimilarity . |
|
The default implementation is: return idf(searcher.docFreq(term), searcher.maxDoc());Note that Searcher#maxDoc() is used instead of org.apache.lucene.index.IndexReader#numDocs() because it is proportional to Searcher#docFreq(Term) , i.e., when one is inaccurate, so is the other, and in the same direction. |
The default implementation sums the #idf(Term,Searcher) factor for each term in the phrase. |
Terms that occur in fewer documents are better indicators of topic, so implementations of this method usually return larger values for rare terms, and smaller values for common terms. |
Matches in longer fields are less precise, so implementations of this
method usually return smaller values when That these values are computed under org.apache.lucene.index.IndexWriter#addDocument(org.apache.lucene.document.Document) and stored then using #encodeNorm(float) . Thus they have limited precision, and documents must be re-indexed if this method is altered. |
This does not affect ranking, but rather just attempts to make scores from different queries comparable. |
The default implementation returns 1. |
|
A phrase match with a small edit distance to a document passage more closely matches the document, so implementations of this method usually return larger values when the edit distance is small and smaller values when it is large. |
Terms and phrases repeated in a document indicate the topic of the
document, so implementations of this method usually return larger values
when The default implementation calls #tf(float) . |
Terms and phrases repeated in a document indicate the topic of the
document, so implementations of this method usually return larger values
when |