Normalize HTTP URI

I am getting URIs from Akamai log files that include entries like:

/foo/jim/jam
/foo/jim/jam?
/foo/./jim/jam
/foo/bar/../jim/jam
/foo/jim/jam?autho=<randomstring>&file=jam

      

I would like to normalize them all to one record according to the rules:

  • If there is a query string, open it autho

    and file

    .
  • If the query string is empty, remove the trailing one ?

    .
  • Catalog items ./

    should be removed.
  • Directory entries for <fulldir>/../

    must be removed.

I would think a URI

Ruby library would cover this, but:

  • It does not provide any mechanism for parsing parts of a query string. (Not that it's hard to do, nor is it standard.)
  • It does not remove trailing ?

    if the query string is empty.

    URI.parse('/foo?jim').tap{ |u| u.query='' }.to_s #=> "/foo?"
    
          

  • The method is normalize

    not clearing .

    or ..

    in transit.

So, not having an official library, I find myself writing a regex based solution.

def normalize(path)
  result = path.dup
  path.sub! /(?<=\?).+$/ do |query|
    query.split('&').reject do |kv|
      %w[ autho file ].include?(kv[/^[^=]+/])
    end.join('&')
  end
  path.sub! /\?$/, ''
  path.sub!(/^[^?]+/){ |path| path.gsub(%r{[^/]+/\.\.},'').gsub('/./','/') }
end

      

It happens that for test cases I've listed above, but with 450,000 paths to clean up, I can't test all of them.

  • Is there any glaring error with the above given the likely log file entries?
  • Is there a better way to accomplish the same, which relies on proven parsing techniques instead of my manual reggeza?
+3


source to share


2 answers


The addressable gem will normalize them for you:



require 'addressable/uri'

# normalize relative paths
uri = Addressable::URI.parse('http://example.com/foo/bar/../jim/jam')
puts uri.normalize.to_s #=> "http://example.com/foo/jim/jam"

# removes trailing ?
uri = Addressable::URI.parse('http://example.com/foo/jim/jam?')
puts uri.normalize.to_s #=> "http://example.com/foo/jim/jam"

# leaves empty parameters alone
uri = Addressable::URI.parse('http://example.com/foo/jim/jam?jim')
puts uri.normalize.to_s #=> "http://example.com/foo/jim/jam?jim"

# remove specific query parameters
uri = Addressable::URI.parse('http://example.com/foo/jim/jam?autho=<randomstring>&file=jam')
cleaned_query = uri.query_values
cleaned_query.delete('autho')
cleaned_query.delete('file')
uri.query_values = cleaned_query
uri.normalize.to_s #=> "http://example.com/foo/jim/jam"

      

+5


source


Something that is REALLY important, like ESSENTIAL to remember, is that a URL / URI is a protocol, a host, a path to a resource followed by parameters / parameters passed to the referencing resource. (There are other optional things for the pedantic, but that's enough.)

We can extract the path from the URL by parsing it with the URI class and using path

. When we have a path, we have either an absolute path or a relative path based on the root of the site. Working with absolute paths is easy:

require 'uri'

%w[
  /foo/jim/jam
  /foo/jim/jam?
  /foo/./jim/jam
  /foo/bar/../jim/jam
  /foo/jim/jam?autho=<randomstring>&file=jam
].each do |url|
  uri = URI.parse(url)
  path = uri.path
  puts File.absolute_path(path)
end
# >> /foo/jim/jam
# >> /foo/jim/jam
# >> /foo/jim/jam
# >> /foo/jim/jam
# >> /foo/jim/jam

      



Since paths are file paths based on the server root, we can play games using Ruby's method File.absolute_path

to normalize '". And" .. "away and get the true absolute path. This will be broken if more ..

(parent directory) than chain of directories, but you shouldn't find this in the extracted paths as it will also break the server / browser's ability to serve / request / receive resources.

It gets a little more "interesting" when dealing with relative paths, but at the same time File remains our friend, but that's a different matter.

+2


source







All Articles