Is there a way for a program to know that two urls point to the same page if the urls are slightly different?

advertisement
Ely asked:


I have an application that is pulling in two RSS feeds and eliminating the duplicate entries. Many duplicate RSS entries have different text but they point to the same landing page. The application compares the urls of the landing pages and if they are identical, removes one.

Some landing pages use some text on the end of the url to signify the source (eg, …src=site1 or …source=site2)

A real-world example would be these two links:
(Yahoo shortens the links so you might have to click on them to see the end part of the urls)

URL 1: http://www.computerjobs.com/job_display.aspx?jobid=2506539″
URL2: http://www.computerjobs.com/job_display.aspx?jobid=2506539&utm_source=job_site&utm_medium=organic&utm_campaign=job_site”

Is there a way I can program my application to know that these two links point to the same page? It has to work for other sites as well, not just this one which uses “&utm_source=job_site&utm_medium=organic&utm_campaign=job_site ” after their url to define the source.

Thanks for the help!

Internet marketing course

Related Posts

        
advertisement

2 Responses to “Is there a way for a program to know that two urls point to the same page if the urls are slightly different?”

  1. Make Money Blogging Says:

    Blogging Workshop

    To what level of detail does “same page” mean? Is it the job_display.aspx page? or does it include the jobid (after the ?jobid=2506539)

    If only the page, then parse the URL up to the question mark, and do your compare at that point.

    If you want it to include the job ID, then parse the URL up to the first &, if one exists, which are your additional arguments that’s passed to the page in addition to the job id.

    Does any of this help?

  2. Ecommerce Marketing Says:

    Article spinner

    Essentially, no. All a comparison can do is check the characters. A page can only be linked to directly from the Web by a single string of characters (URL). It can, however, be linked to by another URL redirecting to it, but that’s not detectable from your end.

    Sorry.

Leave a Reply