Java: Do I have an efficient regex for excluding characters and renaming a file?

I have a series of link names that I am trying to exclude special characters from. From a concise file, my biggest problems seem to be parentheses, parentheses and colons. After struggling unsuccessfully with escape characters in SELECT :

[

and (

I decided to instead exclude everything I wanted to STORE in the filename.

Consider:

String foo = inputFilname ;   //SAMPLE DATA: [Phone]_Michigan_billing_(automatic).html
String scrubbed foo = foo.replaceAll("[^a-zA-Z-._]","") ;

      

Expected Result: Phone_Michigan_billing_automatic.html

My character escape rule was approaching 60 characters when I removed it. The last version I saved before changing the strategies was [:.(\\[)|(\\()|(\\))|(\\])]

where I thought I requested escape-character-[()

and ]

.

Eliminating the blanket seems to be very good. Is Regex really that simple? Any input regarding the effectiveness of this strategy? I feel like I am missing something and need a few dreams.

+3


source to share


3 answers


In my opinion you are using the wrong tool for this job. StringUtils has a method named replaceChars that will replace all occurrences of char with a different one. Here's the documentation:

public static String replaceChars(String str,
                              String searchChars,
                              String replaceChars)

Replaces multiple characters in a String in one go. This method can also be used to delete characters.

For example:
replaceChars("hello", "ho", "jy") = jelly.

A null string input returns null. An empty ("") string input returns an empty string. A null or empty set of search characters returns the input string.

The length of the search characters should normally equal the length of the replace characters. If the search characters is longer, then the extra search characters are deleted. If the search characters is shorter, then the extra replace characters are ignored.

 StringUtils.replaceChars(null, *, *)           = null
 StringUtils.replaceChars("", *, *)             = ""
 StringUtils.replaceChars("abc", null, *)       = "abc"
 StringUtils.replaceChars("abc", "", *)         = "abc"
 StringUtils.replaceChars("abc", "b", null)     = "ac"
 StringUtils.replaceChars("abc", "b", "")       = "ac"
 StringUtils.replaceChars("abcba", "bc", "yz")  = "ayzya"
 StringUtils.replaceChars("abcba", "bc", "y")   = "ayya"
 StringUtils.replaceChars("abcba", "bc", "yzx") = "ayzya"

      

So in your example:

    String translated = StringUtils.replaceChars("[Phone]_Michigan_billing_(automatic).html", "[]():", null);
    System.out.println(translated);

      



The output will be:

Phone_Michigan_billing_automatic.html

It will be more readily understandable and understandable than any regex you could write.

+1


source


I think your regex could be as simple as \W

that which will match anything that is not a word character (letters, numbers and underscores). This is denial\W

So your code will look like this:

foo.replaceAll("\W","");

      

As pointed out in the comments, the above also removes periods, which will work for keeping periods as well:



foo.replaceAll("[^\w.]","");

      

Details: avoid every thing that is not ( ^

inside a character class), number, underscore, letter ( \W

) or period ( \.

)

As noted above, there may be other characters you want to use in the whitelist: eg -

. Just include them in your character class as you go forward.

foo.replaceAll("[^\w.\-]","");

      

+1


source


I think your regex is the way to go. In general, whitelisting instead of blacklisting is almost always better. (Only resolving characters you KNOW are good, not eliminating all characters you think are bad). From a security point of view, this regex should be preferred. You will never have inputFilename with invalid characters.

suggested regex: [^a-zA-Z-._]

      

+1


source







All Articles