PAGE Ranking BY SEARCH
ENGINES
Suvendra Kumar Jayasingh
Lecturer in Computer Science
I.M.I.T., Cuttack
Biju Patnaik University of Technology, Odisha
09861016676

Today search engines are becoming best friends of most
of the people for navigation on Internet. User needs to enter keyword or
combination of keywords to trigger the search. Search engine searches the web
pages available on Internet and returns the result as number of ordered pages.
This order should depend on relevancy of pages and importance of pages.
Different search engines use different techniques to
decide importance of page on the web. Alta Vista uses HITS (Hyperlink Induced
Topic Search) algorithm to rank pages while Google uses PageRank algorithm to
rank the pages.
Yahoo also uses
some variation of PageRank algorithm that Google uses. There are many
variations of Page Ranking algorithm. Like confidence based page ranking,
weighted page rank algorithm, query sensitive self-adaptable web page ranking
algorithm etc. But, this paper covers only one algorithm that is based on
formula given by Bin and Page, the founders of Google.

2. What is PageRank?

One
of the most important factors that Google uses is PageRank. PageRank is a numeric value that represents how important a
page is on the web. Off course PageRank is not the only factor, which decides
importance of page, but still it is one of them. PageRank is described by one
mathematical formula that seems very difficult at first, but actually it
is not.
2.1 Formula:
The citation graph of the web is
main resource for calculation of PageRank. In the paper “The Anatomy of a
Large-Scale Hypertextual Web Search Engine” founders of Google, Sergey Brin and
Lawrence Page defined PageRank as:
“We
assume page A has pages T1...Tn which point to it (i.e., are citations). The
parameter d is a damping factor, which can be set between 0 and 1. We usually
set d to 0.85 .……. C(A) is defined as
the number of links going out of page A. The PageRank of a page A is given as
follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) +
... + PR(Tn)/C(Tn)) “
It's the original formula that was
published when PageRank was being developed, and it is probable that Google
uses a variation of it but they aren't telling us what it is. Thus, when one page links to other page, first page
votes some PageRank to the second page. PageRank of a page is the addition of
one constant (1-d=0.15) and the damped value of addition of votes by all pages
pointing to it. Value of vote by a particular page depends on PageRank and
total number of out links of that page. Thus, higher the PageRank, higher is
the value of vote and higher the number of out links, lower is the value of
vote.
So every page distributes 85% of its original PageRank
evenly among all pages to which it points. e.g. if page A is pointing to four
other pages then factor ‘d * PR(A)/4’ will come in PageRank equation of all
those four pages. That four factors will add up to d * PR(A) , i.e. 85% of
PageRank of A.( From here onwards I’ll refer PageRank as PR) Note that when page votes for another page,
it doesn’t give anything from its own PR. Its just voting, only the difference
is that weight of vote of a page depends on its own PR. It is same as shareholders
meeting where weight of vote of shareholder depends on the shares held.
2.2 How to use formula?
If we look at the equation, it’s obvious that PR of a
page depends on PR of other pages, which are pointing to it, which may depend
on PR of original page if it has back link to that page. e.g. If there are only
two pages A and B which points to each other, then PR of each page will depend
on PR of other, which is yet uncalculated. So where to start from? Surprisingly, we can start with any assumed
values of PR. And repeat the calculation of PR values of all pages iteratively
till they become stable.
Lets see an example: consider there
are only two pages A and B, which points to each other. We don’t know what
their PR should be to begin with, so let’s take a guess at 1 and do some calculations.
PR(A)
= (1-d) + d * PR(B)
= (1-0.85) + 0.85 * 1
=
1
PR(B) = (1-d) + d * PR(A)
=
(1-0.85) + 0.85 * 1
=
1
Since
they are not changing we can stop here. Now lets do the calculations with
initial guess at 0:
PR(A) = 0.15 + 0.85 * 0
= 0.15
PR(B) = 0.15 + 0.85 * 0.15
= 0.2775
After
second iteration:
PR(A) = 0.15 + 0.85 * 0.2775
= 0.385875
PR(B)
= 0.15 + 0.85 + 0.385875
= 0.47799375
After
third iteration:
PR(A)
= 0.15 + 0.85 * 0.47799375
= 0.5562946875
PR(B) = 0.15
+ 0.85 * 0.5562946875
= 0.622850484375
and
so on. They will keep changing till they reach value of 1. If we start with
values greater than 1, then also values will degrade iteration by iteration and
will settle down to 1. If we start calculation with different values of PR(A)
and PR(B), then also we will end up with PR(1) = PR(B) = 1. Here we considered
symmetrical link structure between A and B, thus with any value of PR(A) and
PR(B) to start, we ends up with PR(A) = PR(B) = 1. If we take asymmetrical
structure e.g. only A is pointing to B and B is not pointing to A, then we will
end up with consistent values of PR(A) and PR(B), whatever the values we start
with.
For
more practice with different numbers of pages with different configurations of
citation graphs, visit:
www.webworkshop.net/pagerank_calculator.php?lnks=2,10,15&iblprs=0.15,0.15,0.15,0.15&pgnms=&pgs=2&initpr=1&its=100&type=simple
3. How is PageRank used?
Google uses PR as one of the important factor in
search process to find the relevancy of page. So while searching, web pages are
accessed in the decreasing order of PR. One can find PR of its own webpage by
using Google toolbar (http://toolbar.google.com/).
But output of this toolbar is somewhat unexpected. As our normal PR calculation
can yield PR ranging from 0.15 to unlimited. But toolbar gives PR of any page
in the range 1 to 10. So actual PR values are divided into intervals and are
represented as one of the value ranging from 1 to 10. But the question is
whether these intervals are equidistant? Means is the scale linear? No one
outside Google knows it. For this question, there are different answers from
different researchers. According to Ian Rogers, intervals cannot be
equidistant. Means the scale used by Google must be logarithmic. This yields a
result that efforts needed to increase PR from 2 to 3 are very less as compared
to increase PR from 8 to 9. What these efforts are? At the end of this paper
anyone can answer this question.
4. Examples and Observations:
Google does not provide any
information about methods to improve PR values of page or site. But we can
solve different examples and come up with different observations. Before
starting examples I would like to explain difference between PR of page and PR
of site. Google assigns PR for each page on the web. But from the site point of
view, total PR of all pages of site is important. Thus PR of site is nothing
but sum of PRs of all pages of site.
Example
1.


= 0.15
PR(B)
= (1-d) + d (0)
=
0.15
In this example A and B don’t have any inbound and
outbound link. They come up with PR values equal to 0.15. Thus we can conclude
that: “Minimum value of PR of any page is 0.15.”
Example
2.



=
1
PR
(B) = (1-d) + d (0)
=
0.15
Actually this is not the correct result. There are two
concepts: orphan pages and dangling links. Orphan
pages are those, which don’t have any inbound link. For a page to be indexed by
Google it must have at least one page linking to it. If a page is not in the
Google index, it and its links don't exist as far as the calculations are
concerned. Therefore, it can't have a PR value and it can't share PR with other
pages. Page B is orphan page and thus will be eliminated during PR calculation.
And thus page A will not receive any vote from page B.
Dangling links are links that go to pages
that don't have any outbound links. These links are dropped for the duration of
the calculations. Link from B to A is a dangling link and will be eliminated
during PR calculation. But here onwards we will consider only small part of
whole web. Thus even there are no inbound links shown for a particular page,
they are assumed to be present there. Thus we will not consider these two
concepts anymore in our examples.
From here onwards I’ll not show all
calculations. Instead I’ll show only final settled values of PR after large
number of iterations.
Example 3
![]() |
|||
![]() |
|||
Example
4.
![]() |



This
is hierarchical structure. In general homepage of website should have maximum
PR. We need our PR to be concentrated at homepage. Thus we can use hierarchical
structure and can channel large proportion of PR of site to where we want.
Example
5:
![]() |
Example
6:
![]() |
Compare
example 5 with example 6. Page A, B and C are pages of same site. External site
1 is pointing to page A and page C is pointing to page C. In both of these examples, PR of external
site 1 is assumed to be equal to 1. And then PR values of remaining pages are
calculated. Can we infer something from these examples? Definitely. There is PR
leak in example 5, which is avoided in example 6. In example 5 page is C gives
its entire vote to external site. While in example 6 page C gives part of it to
external site and part of it to page A. This part of vote from C increases PR
value of A and thus increases weight of vote of A. In turn PR of B and C
increase which, in turn again increases weight of vote of C. This is iterative
process, which finally results into increased value of PR of all A, B and C.
This is definitely in the favor of PR of whole site.
From these examples we can conclude
that if any page of site is pointing to external page, then we can reduce the
PR leak by increasing citation network.
Can
we decrease PR leak by introducing reciprocating links between B and C? Lets
try this.










|
![]() |
That’s great!!! Average PR of site increased as we
expected.
5. How to increase PageRank?
There are different ways to increase
PR of your site. From examples illustrated above we can come up with some ideas
to increase PR of web site. Some standard ideas are given below:
5.1 Add spam pages:
As shown in figure, we can increase total PR of
website by adding more pages into site. No matter, what are the contents of
these pages. Total PR of site increases, but we can’t increase average PR of
site (i.e. PR per page). Upper limit on average PR of site is 1.
![]() |
5.2
Join forums:
Forums
are a great way to achieve links to your website. In most forums you are
allowed to have a signature and in your signature you can put a link to your
website. But another important note to look on is making sure the forum is
somewhat related to your website. You will still get credit if it's not, but if
it's related to your website then you will be accomplishing two tasks at once.
From this people will come to know about your site and this will help to
increase popularity of your site.
5.3
Submit to search engine directories.
Search engine directories are a good way to get a free
link to your website. They also increase your chances at being listed higher on
popular search engines like Google, and others. Most search engine directories
allow you to submit to their website for free. This will allow you to increase
your web presence by being listed on another search engine, and it will also be
a free link. Remember the more links you have the higher your PR will be.
5.4 Reciprocating Links:
You can search for sites related to same topic as that
of your website. If PR of that site is higher than your site, then you can pay
that site and can create reciprocating links among two. There are link-farming
sites, which exchanges links with other sites. Actually this is illegal. Google
is banning the sites participating in link farming. Thus be aware of link
farming!!
5.5 Contents:
Last and the most important way to increase PR of your
site is to keep solid contents on your sites. Other sites will automatically
link your site if you are having good contents. Really there is no substitute
for good contents!!
6. Conclusion:
Even
though formula for calculating PageRank seems to be difficult, it is easy to
understand. But when a simple calculation is applied hundreds of times over the
results can seem complicated. And we cannot predict the result of these
iterations. Surely, more practice can yield more observations.
PageRank is
important factor considered in Google ranking, but it is only one of the
important factors considered. e.g. now a days Google is paying a lot of
attention to the link’s anchor text while deciding relevancy of target page.
But as PageRank
is also one of the important factor, one should be well aware of PageRank while
designing the website.
7.References:
[1] Altman,
Alon; Moshe Tennenholtz (2005). "Ranking
Systems: The PageRank Axioms" (PDF). Proceedings of the 6th ACM conference on Electronic
commerce (EC-05). Vancouver, BC. http://stanford.edu/~epsalon/pagerank.pdf. Retrieved 2008-02-05.
[2] Haveliwala,
Taher; Jeh, Glen and Kamvar, Sepandar (2003).
"An Analytical Comparison of Approaches to
Personalizing PageRank" (PDF). Stanford University Technical Report. http://www-cs-students.stanford.edu/~taherh/papers/comparison.pdf.
[3] Langville,
Amy N.; Meyer, Carl D. (2003). "Survey: Deeper Inside PageRank". Internet Mathematics 1
(3).
[4] Langville,
Amy N.; Meyer, Carl D. (2006). Google's
PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press. ISBN 0-691-12202-4.
[5] Richardson,
Matthew; Domingos, Pedro (2002). "The intelligent surfer: Probabilistic
combination of link and content information in PageRank" (PDF).
Proceedings of Advances in
Neural Information Processing Systems
http://www.iprcom.com/papers/pagerank/
http://www-db.stanford.edu/~backrub/google.html
No comments:
Post a Comment