.NET Tutorials, Forums, Interview Questions And Answers
Welcome :Guest
 
Sign In
Register
 
Win Surprise Gifts!!!
Congratulations!!!


Top 5 Contributors of the Month
david stephan

Home >> Articles >> C# >> Post New Resource Bookmark and Share   

 Subscribe to Articles

Website Recursive Url Parser

Posted By:Jean Paul       Posted Date: March 26, 2011    Points: 200    Category: C#    URL: http://www.dotnetspark.com  

In this article I am trying to share a piece of code that might be useful to some of the developers.
 

We can find a lot of code in C# that will parse the http urls in given string.  But it is difficult to find a code that will: 

  • Accept a url as argument, parse the site content
  • Fetch all urls in the site content, parse the site content of each urls
  • Repeat the above process until all urls are fetched.

Scenario

Taking the website http://valuestocks.in (A Stock Market Site) as example I would like to get all the urls inside the website recursively. 

Design 

The main class is SpiderLogic  which contains all necessary methods and properties.

  

The GetUrls() method is used to parse the website and return the urls.  There are two overloads for this method.

 

The first one takes 2 arguments.  The url and and a Boolean indicating if recursive parsing is needed or not.

 

Eg: GetUrls(http://www.google.com", true);

 

The second one is 3 arguments, url, base url and recursive Boolean.

This method is intended for usage like the url is a sub level of the base url.  And the web page contains relative paths.  So in order to construct the valid absolute urls, the second argument is necessary.

 

Eg: GetUrls("http://www.whereincity.com/india-kids/baby-names/", http://www.whereincity.com/, true);

 

Method Body of GetUrls()

 

public IList<string> GetUrls(string url, string baseUrl,

bool recursive)

{

    if (recursive)

    {

        _urls.Clear();

        RecursivelyGenerateUrls(url, baseUrl);

 

        return _urls;

    }

    else

        return InternalGetUrls(url, baseUrl);

}

 

InternalGetUrls() 

Another method of interest would be InternalGetUrls() which fetches the content of url, parses the urls inside it and constructs the absolute urls.

 

private IList<string> InternalGetUrls(string baseUrl, string absoluteBaseUrl)

{

    IList<string> list = new List<string>();

 

    Uri uri = null;

    if (!Uri.TryCreate(baseUrl, UriKind.RelativeOrAbsolute, out uri))

        return list;

 

    // Get the http content

    string siteContent = GetHttpResponse(baseUrl);

 

    var allUrls = GetAllUrls(siteContent);

 

    foreach (string uriString in allUrls)

    {

        uri = null;

        if (Uri.TryCreate(uriString, UriKind.RelativeOrAbsolute, out uri))

        {

            if (uri.IsAbsoluteUri)

            {

                if (uri.OriginalString.StartsWith(absoluteBaseUrl)) // If different domain / javascript: urls needed exclude this check

                {

                    list.Add(uriString);

                }

            }

            else

            {

                string newUri = GetAbsoluteUri(uri, absoluteBaseUrl, uriString);

                if (!string.IsNullOrEmpty(newUri))

                    list.Add(newUri);

            }

        }

        else

        {

            if (!uriString.StartsWith(absoluteBaseUrl))

            {

                string newUri = GetAbsoluteUri(uri, absoluteBaseUrl, uriString);

                if (!string.IsNullOrEmpty(newUri))

                    list.Add(newUri);

            }

        }

    }

 

    return list;

}

 

Handling Exceptions 

There is an OnException delegate that can be used to get the exceptions occurring while parsing.

                    

Tester Application 

A tester windows application is included with the source code of the article.

You can try executing it. 

  The form accepts a base url as the input and clicking the Go button it parses the content of url and extracts all urls in it.  If you need a recursive parsing please check the Is Recursive check box.

 

Next Part 

In the next part of the article, I would like to create a url verifier website that verifies all the urls in a website.  I agree after doing a search we can find free providers like that.  My aim is to learn & develop a custom code that could be extensible and reusable across multiple projects by community.


 Subscribe to Articles

     

Further Readings:

Responses

No response found. Be the first to respond this post

Post Comment

You must Sign In To post reply
Find More Articles on C#, ASP.Net, Vb.Net, SQL Server and more Here

Hall of Fame    Twitter   Terms of Service    Privacy Policy    Contact Us    Archives   Tell A Friend